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Executive  Summary 

Currently,  an  air  operation  center  (AOC)  for  a  major  regional  conflict  is  composed  of  more  than  400  personnel, 
hundreds  of  computer  servers  and  an  extensive  communication  infrastructure.  So  in  addition  to  the  goals  of 
achieving  a  faster,  more  real  time  response,  there  is  also  a  desire  to  reduce  the  manning  and  equipment  associated 
with  the  endeavor.  This  means  that  the  management  of  redundancy  must  be  optimized.  An  important  consideration 
in  the  design  and  fielding  of  such  systems  is  its  capacity  to  accommodate  faults  through  control  reconfiguration  using 
either  direct  or  analytic  redundancy.  The  latter  relies  on  exploiting  the  inherent  dynamic  and  static  relationship 
of  the  system  variables,  and  having  the  advantage  of  most  efficient  use  of  the  components.  The  proposed  research 
seeks  to  apply  the  concepts  of  control  reconfigurability  to  C2  systems  modeled  as  stochastic  discrete  event  systems. 

This  report  contains  ten  self-contained  sections  drawn  from  eight  fully  refereed  conference  papers  and  two  reports 
produced  over  the  past  four  years.  The  sections  are  organized  into  three  chapters:  Reconfigurability  Concepts 
and  C2-AO  Relationship,  Feedback  and  Reliability  of  Wireless  Sensor  Networks  in  C2,  and  Supervisory  Control 
and  Performance  Analysis  of  C2  Database.  This  effort  has  produced  three  MSEE  Theses  (Metzler,  Kussard,  and 
Ruschmann),  one  additional  conference  paper  under  review,  one  accepted  journal  paper,  and  one  invited  journal 
paper  to  be  submitted  within  this  month. 
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1.  Strategic  Reconfigurability  in  Air  Operations 
1.1.  Problem  description 

In  military  air  operations,  the  delivery  speed  and 
accuracy  of  modern  weapons  make  swift  decisions  desir¬ 
able,  while  advanced  sensing  and  processing  capabilities 
make  them  possible.  These  traits  have  been  well  recog¬ 
nized,  and  have  directed  the  focus  of  some  research 
activities  onto  closing  the  feedback  control  loops  in 
air  operations,  despite  many  differing  philosophies  in 
modeling  [12],  [21],  [11]  and  techniques  for  analysis 
and  control  [10],  [17],  [41].  Due  to  large  uncertainties 
and  computational  burdens,  however,  it  is  difficult  for 
feedback  designs  to  be  proactive  [11],  which  requires 
accurate  prediction  far  enough  into  the  future.  Though 
one  can  establish  attrition  models  in  the  form  of  Markov 
chains  from  the  first  principles  [21],  such  models  provide 
reliable  predictions  only  for  a  general  population  of 
air  operations.  They  capture  little  dynamic  effect  of 
an  individual  tactical  operation  on  the  execution  of  a 
strategic  plan. 

The  first  section  of  this  report  is  intended  to  address 
the  issue  of  resilience  in  air  operations.  The  notion  of 
strategic  reconfigurability  is  introduced  that  measures 
the  ability  of  a  military  air  operation  to  successfully 
respond  to  contingencies.  The  intended  goal  requires 
a  new  modeling  framework  that  is  geared  toward  the 
closed-loop  control  of  the  air  operation,  and  allows  a 
closer  scrutiny  on  how  a  tactical  execution  affects  the 
evolution  of  a  strategic  plan.  This  section  proposes  to 
model  an  air  operation  as  a  two- level,  two-timescale 
feedback  system.  In  this  framework,  coverage  of  state 
transitions  in  stochastic  discrete  state  model  with  rel¬ 
atively  a  low  resolution  serves  to  effect  the  supervisory 
control  of  the  strategic  operation,  while  the  remaining 
detail  in  the  air  operation  is  represented  by  a  continuous 
state  model  with  bounded  disturbances.  The  former 
will  be  called  a  strategic  model,  and  the  latter  will 
be  called  a  tactical  model.  The  overall  model  can  be 
regarded  as  a  hybrid  dynamic  system.  Unlike  most 
existing  efforts  in  hybrid  system  modeling  [2],  however, 
our  interest  is  in  capturing  the  interactions  between 
the  strategic  and  the  tactical  models.  In  particular, 
strategic  states  enter  the  tactical  model  as  symbolic 
parameters,  while  the  dynamic  effects  of  the  tactical 
operation  on  the  strategic  model  are  represented  by 
a  set  of  state  transition  coverage  parameters.  The  two- 
level  framework  makes  it  possible  to  describe  a  complex 
air  operation  with  a  model  of  much  reduced  complexity, 
provided  that  the  two  sets  of  parameters  project  well 
the  collective  effects  of  the  two  models  on  each  other. 

Among  many  possible  benefits  in  using  the  two-level 
air  operation  model,  this  section  focuses  on  making 
use  of  the  separability  of  the  two-level  model  in  the 


investigation  of  strategic  reconfigurability.  The  notion 
of  strategic  reconfigurability  is  particular  with  respect 
to  transitions  of  strategic  states  that  are  controllable  by 
Blue.  The  transition  coverage  allows  our  study  on  re¬ 
configurability  to  include  the  dynamics  associated  with 
information  acquisition  and  processing,  and  the  risks 
associated  with  control  policy  making.  Our  investigation 
can  be  confined  at  the  strategic  level  with  the  effects 
of  tactical  operation  observable  through  a  set  of  state 
transition  coverage  parameters. 

The  section  is  organized  as  follows.  Subsection  B 
describes  the  two-level  framework  and  the  role  of 
transition  coverage  in  it.  Subection  C  defines  the  notions 
of  specific  and  strategic  reconfigurability,  and  explains 
their  utility.  A  summary  is  presnted  drawn  in  Subection 
D. 

1.2.  Coverage  of  transitions  in  a  strategic  model 

A  strategic  model  refers  to  the  mathematical  de¬ 
scription  of  the  evolution  of  a  strategic  plan  in  an 
air  operation.  It  takes  the  form  of  a  discrete  state 
and  continuous  (discrete)  time  Markov  process  (chain) 
[20].  The  fundamental  assumption  of  a  Markov  process 
is  that  the  probability  that  a  system  will  undergo  a 
transition  from  one  state  to  another  state  depends 
only  on  the  current  state  of  the  system  and  not  any 
previous  states  the  system  may  have  experienced.  To 
specify  a  Markov  model,  it  is  necessary  to  identify  (i)  a 
state  space  {A},  (ii)  a  set  of  initial  state  probabilities 
{Px{ 0),£  G  A},  and  (iii)  a  set  of  state  transition  rates 
{A x,x '  (£),#, x'  G  X}  from  the  current  state  x  to  the  next 
state  x'  [8] .  A  Markov  process  is  said  to  be  homogeneous 
if  all  state  transition  rates  are  independent  of  time. 
Strategic  planning  for  an  air  operation  can  be  regarded 
as  the  process  of  determining  both  the  state  space  and 
the  state  transition  rates  of  a  Markov  process  under 
some  constraints  which  include  limited  resources,  laws 
of  physics,  etc. 


Figure  1.1  A  low  resolution  strategic  model 

A  transition  coverage  associated  with  a  transition  from 
x  to  x'  is  the  conditional  probability  that  the  intended 
transition  in  fact  occurs  given  that  a  triggering  event 
has  arrived.  It  is  denoted  by  c%xt(t),  where  u  indi¬ 
cates  its  dependence  on  the  control  policy  used  in 
the  tactical  operation.  It  provides  a  means  to  separate 
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the  handling  of  an  event  from  the  occurrence  of  the 
event.  Figure  1.1  shows  a  sample  strategic  model  at 
a  very  low  resolution  in  the  form  of  a  rate  diagram 
of  a  Markov  process.  The  simple  model  is  chosen  to 
highlight  our  approach.  In  practice,  the  resolution  is 
determined  by  our  ability  and  willingness  to  deal  with 
complexity,  and  by  the  resolution  of  the  information 
available  to  us.  It  can  be  seen  that  a  transition  coverage 
serves  to  effectively  modify  the  corresponding  transition 
rate  via  \X,X'C™  x,(t).  This  modification  is  justified  by 
a  decomposition  property  of  a  Poisson  process  [22], 
where  the  arrival  of  a  triggering  event  may  lead  to 
two  (multiple)  transitions  with  conditional  probabilities 
which  we  call  transition  coverage  and  complementary 
coverage,  respectively. 

The  state  space  of  the  strategic  model  in  Figure 
1.1  is  determined  by  the  composition  of  four  binary 
states:  (Blue  threatened,  Blue  defeated,  Red  targeted, 
Red  defeated).  A  state  of  (True,  False,  True,  False)  can 
be  represented  by  x  =  1010  in  binary.  The  meanings 
of  the  remaining  states  can  be  similarly  explained.  The 
possible  states  that  do  not  show  up  in  the  diagram 
are  ones  that  have  identically  zero  state  probabilities. 
The  links  that  are  missing  between  a  pair  of  states 
correspond  to  transitions  that  have  zero  transition  rates. 
States  0000,  1000,  and  1010  in  Figure  1.1  are  transient 
states  and  states  0100,  0101,  and  0001  are  absorbing 
states.  Depending  on  whether  preserving  the  Blue  assets 
besides  destroying  the  Red  assets  is  also  part  of  the 
mission  of  an  air  operation,  the  set  of  desirable  outcomes 
for  Blue  can  be  one  of  {0101,0001}  and  {0001}. 

The  discussion  on  modeling  in  this  section  is  focused 
on  transition  coverage  rather  than  transition  rates. 
Therefore  it  is  assumed  that  state  transition  rates 
have  been  determined  for  the  model.  The  reader 
having  an  interest  in  how  transition  rates  in  air 
operations  are  determined  is  referred  to  the  two  sample 
models  [21],  [41]  in  Table  SI  which  also  intends  to 
show  the  differences  of  the  two  models.  It  is  easy 
to  determine  which  transitions  are  controllable  by 
Blue.  The  controllable  transitions  are  those  having  a 
transition  coverage  attached  to  them.  When  a  strategic 
plan  is  fully  carried  out,  all  values  of  transition 
coverage  associated  with  controllable  transitions  are 
set  to  1  identically.  When  a  planned  action  associated 
with  a  transition  is  totally  ignored,  the  value  of  the 
corresponding  transition  coverage  is  set  to  0.  In  this 
regard,  transition  coverage  serves  as  a  supervisory 
control  agent  at  the  strategic  level,  which  is  dictated 
by  the  extent  of  success  in  carrying  out  the  tactical 
execution. 


Reference 

Jelinek  [21] 

Wohletz  et  al.  [41] 

Discrete 

states 

Quantitative  numbers 
of  live  units  making  up 
the  Blue  and  Red  as¬ 
sets,  each  with  a  cer¬ 
tain  amount  of  remain¬ 
ing  ammunition 

Qualitative  symbols 

formed  from  parallel 
composition  of  smaller 
sets  of  states,  such 
as  that  from  a 

Blue  aircraft  (base, 
ingress,  egress,  dead), 
a  Red  target  (known, 
unknown,  dead),  etc. 

Transition 

rates 

Derived  from  the  first 
principles  in  terms  of 
the  number  of  surviving 
units,  number  of  tar¬ 
gets  in  an  area,  time  to 
search  for  a  target,  time 
to  aim  weapon  at  it, 
weapon  lethality,  etc. 

Derived  from  Poisson 
clock  structures  that 
define  the  arrivals  of 
triggering  events,  such 
as  (launch,  threat 

engage,  target  engage, 
land)  for  aircraft, 

etc.,  with  specified 
interarrival  mean  times 

Model 

utility 

Outcome  prediction  of 
air  operations 

Decisions  in  and  control 
of  air  operations 

Table  SI  Two  sample  Markov  models 


We  raise  the  following  questions.  What  constitutes 
an  appropriate  measure  for  the  effect  of  a  transition 
coverage  on  the  overall  air  operation?  What  factors 
determine  the  effectiveness  of  transition  coverage  in  a 
controllable  transition?  The  answer  to  the  first  question 
will  be  attempted  in  the  next  subsection,  and  a  qual¬ 
itative  answer  to  the  second  question  is  given  in  the 
current  subsection. 

One  fundamental  assumption  in  strategic  planning  for 
air  operations  that  involve  feedback  is  the  knowledge  of 
the  current  strategic  state.  This  information  is  often 
deficient  in  a  real  air  operation.  Therefore,  risks  exist 
in  executing  a  control  policy  that  attempts  to  drive  the 
air  operation  into  a  desirable  next  strategic  state.  These 
risks  can  now  be  properly  represented  using  transition 
coverage  in  the  strategic  model  so  that  their  effects  can 
be  studied.  Such  a  study  is  expected  to  help  modifiy 
the  strategic  plan  and  determine  control  policies  that 
minimize  certain  risks. 

Strategic  model 


Tactical  model 


Figure  1.2  A  two- level  model  of  an  air  operation 

Our  answer  to  the  second  question  begins  with  the  de¬ 
scription  of  a  tactical  operation  model.  A  tactical  model 
refers  to  a  closed-loop  control  system  which  governs 
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the  execution  of  an  air  operation.  A  tactical  model 
is  a  continuous-state  dynamic  process  with  bounded 
disturbances.  Figure  1.2  shows  how  it  is  related  to 
a  (overly  simplfied)  strategic  model  represented  by  a 
single  transition.  Note  that  the  tactical  states  do  not 
include  those  of  the  opposing  force  (Red).  With  the 
use  of  a  smaller  timescale  than  that  of  the  strategic 
model,  a  tactical  model  is  able  to  describe  the  behavior 
of  Blue  in  greater  detail.  The  tactical  state  vector  may 
contain  in  its  components,  for  example,  the  strengths 
of  Blue  assets,  the  rates  of  change  of  the  strengths, 
their  geographic  locations,  rates  of  change  of  locations, 
etc.  In  this  fashion,  a  less  likely  strategic  state  that 
enters  the  tactical  model  as  a  deterministic  symbolic 
or  numerical  parameter  will  be  regarded  as  equally 
important  in  the  tactical  operation  once  it  has  arrived. 
The  air  operation  at  the  tactical  level  is  responsible  for 
generating  two  sets  of  signals.  One  is  an  estimate  of 
the  strategic  state  x,  and  another  is  a  corresponding 
control  ux  to  drive  the  tactical  state  £  to  wherever 
desirable.  Suppose  with  respect  to  each  strategic  state 
a  prescribed  control  performance  threshold  Tx  is  set 
such  that  attained  control  performance  J x  is  required 
to  exceed  Tx.  The  transition  coverage  from  x  to  x'  can 
now  be  formalized  as 

Cx,x'(t)=  =  {x  e  X\ J“  >  %} 

x'e^u 

(1) 

where  p(x'\x')  is  the  probability  of  estimate  x!  parame¬ 
terized  with  respect  to  x' .  For  simplicity  the  estimate 
is  assumed  to  be  unbiased  and  consistent  [7]. 

The  models  in  the  two  levels  together  form  a  system 
which  is  hybrid  not  only  in  terms  of  discrete  state  versus 
continuous  state  descriptions,  but  also  in  terms  of  slower 
versus  faster  dynamics. 

(9)  indicates  that  c x  x,  is  functionally  related  to  the 
required  tactical  level  performance  Tx> ,  the  achieved  tac¬ 
tical  performance  JF,  via  using  u ,  and  the  uncertainty 
description  associated  with  estimate  x'  of  strategic 
state  x' .  Transition  coverage  therefore  summarizes  the 
dynamic  response  of  a  tactical  operation. 


Figure  1.3  A  typical  temporal  profile  of  a  transition 
coverage 


One  aspect  of  transition  coverage  that  has  not  been 
explicitly  discussed  is  its  dependence  on  time.  The 
time  dependence  of  coverage  is  mainly  attributed  to 
the  time  dependence  of  the  conditional  distribution  of 
the  estimate  of  a  strategic  state,  as  shown  in  (9).  We 
make  a  loose  argument  that  as  information  is  acquired 
and  processed,  the  distribution  of  x'\x'  becomes  more 
specific.  Assume,  for  simplicity,  that  Tx>  =  T,  Vx',  and 
that  the  mean  of  the  estimate  of  x'  is  in  i.e., 

E^{x'\x'}(t)  =  xp(x'\x')(t)  =  x'  £  flu¬ 

x' 

If  increasing  t  eventually  reduces  the  spread  of 
^(x'lx')^)  to  the  extent  that  ^(x'lx'Xoo)  =  <5(x'  —  x'), 
then  cx  x,  ( t )  tA^°  1.  This  argument  is  intended  to  signify 
the  general  trend  of  cx  x,(t )  as  t  increases.  In  particular, 
transition  coverage  is  expressible  by  a  three  parameter 
representation  A(  1  —  Ae_t/T),  as  shown  in  Figure  1.3, 
where  0  <  A  <  1  defines  the  final  value  of  a  transition 
coverage,  0  <  A  <  1  defines  the  total  increment  of  the 
transition  coverage  over  time,  and  r  >  0  defines  the  rate 
of  increase  of  the  transition  coverage.  This  expression  is 
used  to  capture  both  the  steady  state  and  the  transient 
tactical  performance.  In  fact  cx  x,  ( t )  can  be  obtained 
in  real-time  by  using  (9)  where,  for  example,  the  first 
and  second  moments  of  the  probability  distribution 
^(x'lx')^)  is  estimated  from  the  samples  collected. 

The  simple  example  in  Figure  1.1  is  used  to  observe 
some  rather  dramatic  effects  of  the  level  and  the 
dynamics  of  coverage  on  winning  probabilities  at  100 
hours  into  the  air  operation.  The  following  data  are 
used  in  producing  the  result  in  Table  S2.  Ao,8  =  0-2, 
As,o  =  0.02,  Ag,4  =  0.04,  As,5  =  0.001,  Agpo  =  0-4, 
Aio,4  =  0.005,  Aio,i  =  0.05. 


C8,lCb  c10,l 

Blue  wins 

Red  wins 

Both  lose 

1,  1 

0.82 

0.18 

0.00 

0.5,  0.5 

0.21 

0.58 

0.21 

.95(1  -  .5e-*/6), 
.95(1  -  .5e-t/10) 

0.56 

0.35 

0.09 

Table  S2  Winning  probabilities  at  t  =  100  hours  when 
transient  state  probabilities  have  died  out. 


1.3.  Strategic  reconfigurability 

The  analysis  of  the  previous  subsection  has  shown 
that  transition  coverage  cx  x,  ( t )  can  serve  to  effect 
supervisory  control  of  an  air  operation.  In  particular, 
a  Markov  decision  problem  [8]  can  be  formulated  for 
which  an  optimal  supervisory  control  policy  can  be 
sought  from  an  admissible  set  {cxxf(t)}.  A  trivial 
example  would  be  the  greedy  policy  maxu  eucux&,{t) 
with  cost  functional  1  —  cx  ( t ) ,  where  U  is  the  set 
of  admissible  tactical  control  laws.  U  can  be  finite, 
countably  infinite,  or  uncountable. 

The  purpose  here,  however,  is  not  to  solve  a  Markov 
decision  problem,  but  to  define  a  measure  of  the  ability 
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of  an  air  operation  to  respond  favorably  to  contin¬ 
gencies.  From  the  view  of  a  discrete  event  dynamic 
system,  a  contingent  condition  means  the  arrival  of  one 
of  multiple  probable  triggering  events  that  generally  has 
an  unfavorable  effect  on  Blue.  Referring  to  Figure  1.1, 
it  corresponds  to  a  situation  where  multiple  transitions 
out  of  a  current  state  are  probable.  Therefore,  both 
strategic  planning  and  tactical  execution  must  be  de¬ 
signed  to  be  capable  of  dealing  with  contingencies.  To 
quantify  such  a  capability  a  measure  called  a  strategic 
reconfigurability  is  introduced. 

The  notion  of  control  reconfigurability  has  been 
used  for  fault-tolerant  control  systems  [52]  for  the 
purpose  of  measuring  the  ability  of  a  process  to  al¬ 
low  performance  restoration  via  control  reconfiguration 
in  the  presence  of  faults.  Control  reconfigurability  is 
the  worst  case  minimal  system  Hankel  singular  value 
for  a  specific  set  of  faulty  conditions.  Therefore,  it 
involves  both  controllability  and  observability  of  the 
controlled  process.  The  term  strategic  reconfigurability 
is  adopted  based  on  the  rationale  that  the  measure  has 
indeed  a  close  relationship  to  both  the  controllability 
and  the  observability  of  an  air  operation.  First,  only 
controllable  transitions  will  be  involved  in  the  strategic 
reconfigurability  measure.  In  addition,  it  is  only  through 
the  transition  coverage,  control  actions  can  be  exerted 
onto  the  strategic  operation,  while  a  low  transition 
coverage  is  largely  attributed  to  lack  of  observability 
of  the  involved  strategic  state.  Despite  the  similarity  in 
terminology,  strategic  reconfigurability  defined  in  this 
section  is  an  entirely  different  quantity  from  control 
reconfigurability. 

For  a  given  strategic  model,  the  set  of  state  tran¬ 
sition  probabilities  can  be  solved  from  the  forward 
Kolmogorov  equation 

P(t)  =  P(t)Q(t),P(  0)  =  1.  (2) 


Therefore,  transition  rate  matrix  Q(t)  completely  de¬ 
termines  the  set  of  transition  probabilities.  Let  us 
revisit  the  strategic  model  of  Figure  1.1  to  obtain 
an  explicit  Q(t).  Ordering  the  states  from  absorbing 
to  transient  as  (0001,0100,0101,0000,1000,1010),  or 
(1,4,5,0,8,10)  in  decimal,  for  the  strategic  model,  the 
6x6  transition  rate  matrix  Q(t)  can  be  found  to  be 


o 

o 

o 

o 

o 

Aio,iCio,i(4) 


0 

0 

0 

0 

^8,10(1  ~  C8,loO))  +  ^8,4 
Al0,4 


0 

0 

0 

0 

As, 5 

Aio,i(l  —  cio,i(t)) 


0 

0 

0 


—  Ao,8 
As,0 


0 


0 

0 

0 

Ao,8 

—  A8,4  —  A8,0  —  A8,5 
0 


A8,10 


0 

0 

0 

0 

A8,10C8,10(4) 
—  Aio,l  —  Aio,4 


A  canonical  form  for  a  strategic  model  will  be  used 
in  the  formal  definition  of  strategic  reconfigurability. 


The  steps  below  can  be  followed  to  modify  a  given 
strategic  state  space  to  a  canonical  form.  It  can  be 
checked  that  our  sample  strategic  model  has  been  put 
into  the  canonical  form. 

1)  For  a  given  strategic  model,  decompose  the  state 
space  into  the  set  of  transient  states  and  the  set 
of  absorbing  states,  i.e. ,  X  =  Xt  U  Xa. 

2)  For  the  set  of  transient  states,  identify  the  subset 
of  controllable  states  from  each  of  which  at  least 
one  controllable  transition  stems,  and  denote  the 
subset  by  X£. 

3)  Further  decompose  the  set  of  absorbing  states  into 
the  subset  of  desirable  absorbing  states  and  the 
subset  of  undesirable  absorbing  states,  i.e.,  Xa  = 
X%  U  X% .  The  set  of  desirable  absorbing  states, 
and  the  set  of  undesirable  absorbing  states  are 
dictated  by  the  mission  of  the  air  operation. 

4)  Finally,  reorganize  the  state  space  by  aggregating 
all  desirable  absorbing  states  into  one  winning 
state  xw  for  Blue.  Without  loss  of  generality,  the 
aggregated  winning  state  is  placed  in  the  first 
entry  of  the  strategic  state  vector. 

Definition  1.  Specific  reconfigurability  for  x  G  X£  is 
given  by 


px  =  [  0  •••  0  1  0  •••]  P(  oo ) 


1 

0 


0 


(3) 


where  the  only  entry  in  the  row  vector  on  the  left 
containing  a  1  corresponds  to  where  transient  state  x 
resides  in  the  strategic  state  vector. 

There  is  a  specific  reconfigurability  for  every  x  G 
X£.  Specific  reconfigurabilities  reflect  an  aspect  of  the 
structural  property  of  a  strategic  model.  In  particular, 
the  specific  reconfigurability  associated  with  state  x 
quantifies  the  likelihood  of  winning  the  air  operation 
upon  entering  the  state.  It  also  provides  important 
information  for  real  time  tactical  decisions  and  strategic 
re-planning.  The  structure  of  the  strategic  model  and 
the  transition  coverage  out  of  x  are  the  two  determining 
factors  for  the  value  of  px. 

Definition  2.  Strategic  reconfigurability  is  the  arith¬ 
metic  average  of  all  specific  reconfigurabilities  associ¬ 
ated  with  a  given  strategic  model,  i.e., 

p  =  \xf\  ^  Px'  ^ 

1  1  1  xex;: 

Strategic  reconfigurability  measures  the  ability  to 
respond  to  contingencies  of  the  overall  air  operation. 
Therefore,  it  can  be  used  as  a  criterion  to  compare 
different  strategic  plans. 

The  two  specific  reconfigurabilities  for  state  1000  and 
state  1010  in  the  strategic  model  described  in  Figure 
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1.1  are 


1 

0 


P8 

"  0 

0 

0 

0 

1 

0  " 

Pi  0 

0 

0 

0 

0 

0 

1 

0 

_  0  _ 

The  strategic  reconfigurability  of  the  same  model  is 
1  1 

P  ~  2^8  +  2Pl0' 

We  now  examine  the  effects  of  the  two  factors,  i.e. , 
strategic  model  structure  and  transition  coverage,  on 
the  strategic  reconfigurability. 

Effect  of  strategic  model  structure.  A  modification  of 
the  strategic  plan  associated  with  Figure  1.1  is  made.  In 
the  new  plan  Blue  is  always  alert  and  ready.  To  keep  the 
simplicity  of  the  original  model  and  make  a  meaningful 
comparison  between  the  two  strategic  operations,  only 
Red  targeted  in  the  original  one  of  four  binary  states  has 
been  changed  to  Blue  ready  or  offensive.  This  requires  a 
careful  strategic  planning  that  fully  exposes  the  strength 
of  Blue.  A  changed  strategic  plan  is  shown  in  Figure 
1.4.  The  reader  can  follow  the  description  of  Figure  1.1 
to  understand  what  the  remaining  changes  should  be. 


Figure  1.4  An  alternative  strategic  model 

Order  the  states  from  absorbing  to  transient  as 
(0001,0100,0101,0010,1010),  or  (1,4,5,2,10)  in  deci¬ 
mal,  for  the  model  in  Figure  1.4  to  obtain  a  canonical 
representation  of  the  model.  Taking  A2,io  =  Ao,s  =  0.2, 
Aio,2  =  Ag,o  =  0.02,  and  the  rest  of  parameters  equal  to 
the  values  given  in  the  previous  subsection,  we  obtain 
p  =  0.56  for  the  original  strategic  plan,  and  p  =  0.71  for 
the  alternative  strategic  plan.  The  winning  probabilities 
for  Blue  of  the  two  plans  are  0.57  (Blue  defensive)  and 
0.76  (Blue  offensive),  respectively. 

Effect  of  transition  coverage.  The  previous  subsection 
has  described  that  the  transition  coverage  represents  an 
attempt  to  separate  the  handling  of  an  event  from  the 
occurrence  of  the  event.  It  is  a  means  of  capturing  the 
dynamic  process  of  tactical  execution  including  both 
estimation  and  control,  and  that  it  can  be  manipulated 
to  facilitate  certain  outcomes  in  an  air  operation.  Since 
a  specific  reconfigurability  is  essentially  the  probability 
of  winning  starting  at  a  specific  controllable  state,  it  is 


more  informative  of  the  effect  of  the  transition  coverage 
in  the  planning  and  the  execution  of  an  air  operation. 

For  the  strategic  model  in  Figure  1.1,  specific  recon¬ 
figurabilities  are  calculated  for  the  2x3  parameters  in 
transition  coverage  c8;io(t)  and  cio,i(t).  The  results  are 
shown  in  Figure  1.5.  The  same  state  transition  rates 
and  nominal  coverage  parameters  as  those  in  Table 
S2  are  used.  Only  a  single  parameter  that  represents 
the  horizontal  axis  in  each  plot  is  varied,  and  the 
remaining  5  parameters  are  kept  at  the  nominal  values. 
The  display  of  specific  reconfigurability  is  accompanied 
by  the  corresponding  winning  probability  of  Blue  P\ 
at  the  100th  hour,  for  altering  of  strategic  model 
structure  becomes  necessary  if  the  winning  probability 
does  not  conform  with  an  enhanced  or  reduced  strategic 
reconfigurability.  The  monotonic  dependence  of  the 
Blue’s  winning  probability  on  the  specific  and  therefore 
strategic  reconfigurabilities  is  clearly  shown  in  Figure 
1.5.  The  sensitivities  of  the  dependence  may  depend  on 
the  parameter  ranges  selected. 

p8(100)&P1(100)  p1Q(100)  &  P^IOO) 


Pio 

Pi 

p8 

-  P1 

______  —  — 

0.5  0.7  0.9  A810  0.5  0.7  0.9  A10>1 


Figure  1.5  Specific  reconfigurabilities  p8  (left)  and  pio 
(right)  v.s.  transition  coverage  parameters  (A8?10,  A8  i0, 
78, 10)  and  (Aio.i,  Ai0,i,  tio,i),  respectively. 

1.4.  Section  summary 

The  section  provided  a  way  to  exploit  the  potentiality 
and  to  understand  the  limitation  of  an  air  operation 
by  scrutinizing  the  structure,  and  the  dynamics  within 
it  through  the  notions  strategic  reconfigurability  and 
transition  coverage.  These  notions  reflected  Blue’s  abil¬ 
ity  to  estimate  the  strategic  state  and  to  control  the 
air  operation.  Although  the  aspects  of  modeling  and 
analysis  were  emphasized,  the  results  showed  impor¬ 
tant  implications  on  initial  resource  allocation,  and  on 
dynamical  strategic  planning  and  tactical  execution  for 
air  operations. 

Strategic  reconfigurability  has  been  defined  as  the 
arithmetic  average  of  specific  reconfigurabilities  which 
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are  essentially  winning  probabilities  of  Blue  initiated 
at  controllable  states.  The  major  distinction  of  this 
quantity  from  a  general  winning  probability  is  that, 
besides  being  limited  to  the  controllable  strategic  states, 
there  is  no  a  priori  distribution  attached  to  the  strategic 
states.  Therefore,  contingencies  that  are  controllable 
but  less  likely  to  occur  will  not  be  penalized.  In  fact 
from  this  viewpoint,  one  may  modify  Definition  2  by  a 
set  of  criticality  weighting  factors  {aq,},  i.e., 

P  =  l-ycj  ^  >  ^xPxt  0  <  Qx  ^  1?  ^  ^  Oix  =  1. 

I  1  I  xex*  xexf 

As  a  final  remark,  we  point  out  a  numerical  advantage 
of  using  the  two-level  air  operation  model.  If  large 
disparity  exists  among  transition  rates,  the  numerical 
stiffness  can  make  the  analysis  of  air  operation  difficult 
or  impossible  .  This  problem  can  be  mitigated  by 
absorbing  the  next  state  with  fast  transition  to  the 
originating  state  (assuming  an  infinitely  fast  transition), 
and  capturing  the  dynamics  of  the  fast  transition  in  a 
tactical  model. 
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2.  Operational  reconfigurability 
2.1.  Problem  description 

Our  objective  is  to  meet  the  demand  of  the  ever  de¬ 
creasing  cycle  length  in  military  air  operations,  which  is 
the  sum  of  the  times  required  for  planning,  tasking,  ex¬ 
ecution,  and  evaluation.  This  objective,  when  projected 
onto  the  expectation  for  a  command  and  control  (C2 
hereafter)  center,  implies  a  more  swift  operation  that 
involves  information  gathering,  information  processing, 
decision-making,  and  command  issuing.  The  swiftness 
in  turn  requires  a  high  level  of  C2  system  availability. 
Availability  of  a  system  can  be  generally  thought  of  as 
the  fraction  of  the  system  uptime  divided  by  the  sum 
of  the  uptime  and  downtime.  Our  ultimate  goal  is  to 
achieve  nearly  uninterrupted  C2  operations. 

It  is  apparent  that  a  C2  center  plays  the  role  in  an 
air  operation  as  the  controller  in  a  feedback  system.  It 
carries  out  the  functional  mapping  from  information  to 
decision  in  the  feedback  loop.  A  study  conducted  at  the 
Draper  Labs  [14]  that  focuses  on  the  effect  of  frequency 
of  loop  closure  in  air  operations  concludes  that  ability 
to  close  the  loop  at  a  higher  rate  (4  hour  cycle  v.s.  24 
hour  cycle),  among  other  benefits,  significantly  shortens 
the  time  to  achieve  air  campaign  objectives.  Other 
endeavors  to  enhance  the  C2  functional  capability  using 
different  criteria  and  formalizations  have  also  been 
reported  [10],  [11],  [17],  [41]. 

The  underlying  assumption  so  far  has  been  that  the 
structure  which  supports  the  functional  mapping  in  a 
C2  center  is  always  intact.  In  reality,  however,  a  typical 
C2  center  has  grown  to  be  a  large  and  complex  system, 
and  this  system  is  imperfect.  Many  subsystem  failures 
can  occur  for  many  different  reasons.  For  example,  a 
miscarriage  in  information  flow  can  be  attributed  to  a 
broken  link,  a  faded  or  jammed  channel,  a  power  outage, 
a  failed  sensor,  an  impaired  storage  device,  a  crashed 
processor,  a  human  operator  error,  etc.  In  general, 
failures  that  disable  the  C2  functional  mapping  can  be 
related  to  subsystems  designated  to  perform  data  stor¬ 
age,  transmission,  processing,  or  interpretation.  They 
impact  information  availability,  integrity,  and  decision 
making  in  C2  centers.  The  current  status  in  the  effort 
to  address  these  issues  is  still  in  the  very  early  stage 
of  installing  monitoring  tools.  There  is  a  severe  lack  of 
consideration  in  tackling  the  more  fundamental  issues 
of  redundancy  architecture  and  an  appropriate  level  of 
automation  for  failure  accommodation.  The  latter  is 
important  to  mitigate  unnecessary  human  errors  and 
delays. 

In  our  view,  the  concern  over  the  loss  of  C2  system 
availability  could  be  effectively  addressed  by  a  conscious 
effort  of  modification  to  the  existing  architecture  to 
eliminate  all  single  point  failures.  The  term  C2  system 
has  been  and  will  be  used  in  the  following  development 
to  represent  the  network  of  subsystems  and  compo¬ 


nents  that  host  and  support  the  functional  mapping 
performed  by  the  controller  in  a  C2  center.  The  most 
efficient  way  to  achieve  the  modification  is  to  make  use 
of  and  effectively  manage  the  redundancy  likely  already 
in  existence  in  the  C2  systems.  The  rest  of  the  section 
will  explore  such  a  possibility,  and  quantify  the  benefit 
of  doing  so  to  the  overall  air  operation. 

The  section  is  organized  as  follows.  In  Subsection 
B,  C2  system  modeling  is  discussed  for  the  purpose  of 
availability  analysis.  The  notion  of  operational  reconfig¬ 
urability  is  introduced  to  describe  the  effective  level  of 
redundancy.  Subsection  C  discusses  the  assessment  of 
C2  system  availability  under  variable  conditions  such 
as  subsystem  failure  rate,  effectiveness  of  redundancy 
management,  maintenance  policy,  and  restoration  rate. 
Subsection  D  discusses  the  effect  of  C2  system  op¬ 
erational  reconfigurability  on  the  outcome  of  an  air 
operation.  Subsection  E  draws  conclusions. 

2.2.  Operational  reconfigurability 

The  first  part  of  this  subsection  discusses  qualitative 
availability  modeling  for  a  C2  system.  Based  on  the 
model,  the  need  for  a  measure  on  the  effectiveness  of 
redundancy  management  is  argued,  and  the  notion  of 
operational  reconfigurability  introduced. 

Availability  [15]  is  the  probability  that  a  system  is 
performing  its  required  function  at  a  given  point  in  time 
when  used  under  stated  operating  conditions.  Among 
many  definitions  of  availability,  steady  state  availability 
will  be  considered,  which  represents  the  situation  that 
the  failure-restoration  cycle  has  entered  a  steady  state. 
Such  a  steady  state  definition  will  be  assumed  elsewhere 
in  the  section  without  further  explanation.  The  avail¬ 
ability  value  of  a  system  is  determined  by  the  following 
factors 

(i)  reliability1  distributions  of  individual  subsystems 
and  the  functionalities  of  the  subsystems  in  relation 
to  the  overall  system; 

(ii)  the  policy  and  capability  by  which  the  system  is 
maintained2,  such  as  the  decision  on  the  restoration 
of  failed  subsystems  and  the  distribution  of  the  time 
required  to  do  so; 

(iii)  the  methods  and  the  likelihood  of  success  in 
management  of  existing  redundancy,  which  are  heavily 
influenced  by  our  ability  to  monitor  and  diagnose 
subsystem  failures,  and  to  reconfigure  the  system  upon 
identifying  the  failures. 

Figure  2.1  shows  a  hybrid  model  [46]  of  an  air 
operation.  The  discrete  state  strategic  model  at  the  top 
will  be  further  explained  in  Subsection  D.  The  tactical 

1  Reliability  is  the  probability  that  a  (sub)system  will  perform 
a  required  function  for  a  given  period  of  time  when  used  under 
stated  operating  conditions 

2  Maintainability  is  the  probability  that  a  failed  system  will 
be  restored  to  a  specified  condition  within  a  period  of  time 
when  maintenance  is  performed  in  accordance  with  prescribed 
procedures 
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model  is  represented  in  Figure  2.1  by  a  continuous  state 
closed-loop  control  system  which  governs  the  execution 
of  an  air  operation.  The  forward  loop  contains  a  model 
of  battle  dynamics.  The  tactical  state  vector  may 
contain  in  its  components,  for  example,  the  strengths  of 
Blue  assets,  the  rates  of  change  of  the  strengths,  their 
geographic  locations,  rates  of  change  of  locations,  etc. 
The  functional  mapping  carried  out  by  the  controller 
in  a  C2  center  is  represented  by  the  two  blocks  in  the 
feedback  loop.  It  is  responsible  for  generating  two  sets 
of  signals.  One  is  an  estimate  of  the  strategic  state  x, 
and  another  is  a  corresponding  control  u%  to  drive  the 
tactical  state  £  to  wherever  desirable.  Availability  of 
various  subsystems  in  a  C2  system  is  required  in  order 
to  ensure  that  x,  and  the  parameters  associated  with 
and  the  estimator  are  available  and  correct,  ij  and 
r  are  available  and  current,  and  finally  £  and  u%  are 
current  and  correct. 

Strategic  model 


Tactical  model 


Figure  2.1  A  two- level  model  of  an  air  operation 
An  example  of  a  functional  decomposition  of  a  C2 
system  is  given  in  Figure  2.2  where  the  blocks  marked 
TS  (tactical  and  strategic  sensors),  DL  (I/O  control 
modules  and  data  links),  SM  (storage  media),  CP 
(critical  processors),  and  CS  (critical  software)  represent 
some  of  the  functional  units.  A  functional  unit  is  defined 
as  a  subsystem  of  a  particular  functionality  that  is 
necessarily  available  in  order  for  the  C2  system  to 
be  available.  Each  functional  unit  can  be  a  complex 
interconnection  of  many  subsystems.  Considerable  effort 
is  usually  necessary  to  arrive  at  a  functional  decomposi¬ 
tion.  Let  Ac 2  denote  the  availability  of  the  C2  system, 
and  Ai  is  the  availability  of  the  ith  functional  unit. 
Then,  the  availability  of  a  C2  system  with  N  functional 
units  is  given  by 

Ac2  —  A\  x  A 2  x  •  •  •  x  An.  (5) 


Figure  2.2  Some  functional  units  in  a  C2  system 


In  the  current  C2  systems,  vast  opportunities  exist 
for  availability  improvement  without  hardware  addition, 
and  without  overburdening  the  subsystems  in  terms  of 
processing  speed,  memory  space,  bandwidth,  etc.  The 
opportunities  can  be  seized  by,  for  example,  assigning 
multiple  tasks  to  multiple  subsystems  rather  than  as¬ 
signing  a  single  task  to  a  dedicated  subsystem  within 
a  functional  unit,  or  using  multiple  copies  of  smaller 
data  set  for  recursive  processing  rather  than  using  a 
single  copy  of  larger  data  set  for  batch  processing.  The 
expanded  portions  of  Figure  2.2  show  two  examples  of 
proposed  architectural  change  for  availability  improve¬ 
ment.  The  original  SM  unit  contains  M  storage  media 
holding  independent  databases.  These  subsystems  are 
named  primary  DBp^  for  i  =  1,  •  •  • ,  M.  Each  primary 
subsystem  is  now  appended  with  a  redundant  “cache” , 
called  a  secondary  DBS^,  using  leftover  storage  space 
elsewhere  to  hold  the  most  critical  and  immediately 
needed  data.  The  original  CP  unit  contains  2 K  proces¬ 
sor  cell-Ethernet  switch  pairs  with  non-overlapping 
tasks.  Each  pair  is  now  equipped  with  the  necessary 
(software)  tools  of  one  other  pair,  and  a  K  series 
redundant  CP  unit  is  formed. 

This  work,  however,  is  not  intended  to  explore  innov¬ 
ative  ways  to  raise  the  level  of  redundancy,  but  to  reason 
the  significance  and  to  assess  the  benefit  of  having  an 
adequate  redundancy  level.  A  portion  of  the  C2  system 
above  will  be  used  as  a  vehicle  for  our  intended  purpose. 
This  portion  is  shown  in  Figure  2.3. 


Base-line  architecture  Alternative  redundant  architecture 

Figure  2.3  A  glimpse  of  architectural  change  in  C2 

Most  functional  units  within  C2  are  themselves  com¬ 
plex  interconnections  of  components.  Statistical  model¬ 
ing  [7]  of  the  failure  process  of  individual  subsystems 
must  follow  carefully  designed  experiments  of  data 
collection,  parameter  (or  distribution)  estimation,  and 
goodness-of-fit  tests.  Based  on  our  initial  investigation, 
failure  rates  (number  of  failures  per  unit  time)  of  many 
individual  subsystems  are  below  10-5/hour,  when  in¬ 
termittent  failures  are  excluded.  Therefore,  subsystems 
are  reliable.  There  is  no  doubt  that  intermittent  failures 
will  reduce  the  C2  availability.  Modeling  of  intermittent 
failrues  is  our  ongoing  effort.  In  addition,  both  diagnosis 
and  restoration  for  permanent  failures  of  some  subsys¬ 
tems  can  be  lengthy  processes  (hours  to  tens  of  hours). 
The  most  fundamental  reason  for  need  of  redundancy  is 
the  fragility  of  an  architecture  that  allows  single  point 
failures.  Individual  subsystems  do  fail  and  can  fail  at  an 
unfavorable  time.  The  consequence  to  an  air  operation 
can  be  detrimental,  as  will  be  seen  in  subsection  D. 
It  is  a  fact  to  be  kept  in  mind  that  statistical  model 
development  results  from  our  lack  of  knowledge  of  the 
physical  processes  leading  to  a  failure.  As  a  consequence, 
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we  can  only  infer  from  our  sample  of  failure  data  to 
the  general  population,  and  our  predictions  tell  little 
concerning  an  individual  system  or  failure  occurrence. 
Therefore,  a  non-redundant  architecture  with  highly 
reliable  subsystems  is  robust  but  fragile  [9]. 

In  order  to  measure  the  non- fragility,  the  notion  of  op¬ 
erational  reconfigurability  (OR  hereafter)  is  introduced. 
In  this  section,  OR  is  specific  to  characterize  the  C2 
system  survivability  with  respect  to  single  point  failures. 
Consider  a  canonical  redundancy  architecture  with  a 
parallel-to-series  interconnection  where  each  parallel 
interconnection  in  the  outmost  layer  is  considered  a 
functional  unit.  The  right  side  of  Figure  2.3  shows  a 
two-layer  canonical  interconnection.  It  is  degenerated 
in  the  sense  that  there  are  no  parallel  interconnections 
in  the  inner  layer.  Suppose  there  are  N  functional  units 
in  a  single  layer  canonical  decomposition,  and  each  has 
Mn  ( n  =  1,  •  •  • ,  iV)  subsystems.  Let  cm?n  denote  the 
coverage  of  the  mth  subsystem  in  the  nth  functional 
unit,  where  cm?n  is  define  as  Prob {nth  unit  operates  | 
its  mth  subsystem  has  failed).  Evaluating  cm?n’s  is  not  a 
trivial  task  [43]  because  of  its  association  with  monitor¬ 
ing,  diagnosis,  and  redundancy  management  policy.  Its 
value  also  depends  on  how  many  remaining  operating 
subsystems  are  in  the  functional  unit.  In  general,  the 
larger  the  number  (Mn),  the  larger  the  value  of  cm?n 
due  to  reduced  risk  in  redundancy  management. 

Operational  reconfigurability  OR  for  a  single-layer 
parallel-to-series  interconnected  system  is  given  by 


OR  = 


min 

ne{l,-,N} 


1 

Mn 


m=  1 


(6) 


In  a  multi-layered  parallel-to-series  interconnection 
scheme,  the  expression  would  contain  layers  of 
minimum- average  operations.  OR  points  to  the  weakest 
functional  unit  in  terms  of  its  ability  to  manage  the 
redundancy  for  covering  its  first  failure.  In  particular, 
since  cm?n  =  0  whenever  mn  =  1,  OR  =  0  for 
any  system  that  contains  a  non-redundant  functional 
unit.  This  is  a  measure  without  the  influence  by  a 
priori  subsystem  failure  distributions,  which  is  precisely 
needed  to  reflect  the  non-fragility. 

In  the  representative  C2  system  of  Figure  2.3,  let 
OR 5  denote  the  OR  for  the  baseline  architecture,  and 
ORr  denote  the  OR  for  the  alternative  redundant 
architecture.  Then  OR 5  =  0,  and  ORr  =  1,  if  cm?n  = 
1  V  m,n.  In  general,  0  <  OR  <  1.  OR  essentially 
measures  the  available  redundancy  in  a  C2  system 
and  how  it  is  managed.  Because  of  its  dependence  on 
coverage,  it  is  reflective  of  monitoring  and  supervisory 
control  performance.  Such  performance  indicates  the  C2 
ability  to  allow  restoration  of  system  function  via  recon¬ 
figuration  upon  subsystem  failures.  Reconfiguration  can 
mean  the  removal  of  a  failed  subsystem,  the  switch-on 
of  a  spare  subsystem,  rescheduling  of  jobs,  rerouting  of 


the  information  flow,  redistribution  of  the  information 
storage,  etc. 


2.3.  OR  and  availability 


This  subsection  discusses  availability  modeling  in  a 
more  quantitative  manner.  Its  relation  to  OR,  and  other 
parameters  such  as  restoration  rate  and  maintenance 
policy,  is  of  particular  interest.  Two  simplifying  as¬ 
sumptions  are  made  here,  (i)  All  subsystems  in  Figure 
2.3  have  exponential  failure  time  distributions,  (ii) 
All  restoration  time  distributions  are  also  exponential. 
Through  out  the  subsection  the  representative  C2  sys¬ 
tem  of  Figure  2.3  is  used. 

Viewed  as  a  canonical  form,  the  baseline  architecture 
in  Figure  2.3  contains  only  one  type  of  functional 
unit.  On  the  other  hand,  the  alternative  redundant 
architecture  carries  two  types  of  functional  units. 
The  composite  availability  expression  in  (5)  allows 
us  to  solve  for  the  availability  of  individual  units 
independently.  A  complete  solution  for  availability 
requires  the  specification  of  the  maintenance  policy. 
The  following  policies  are  considered  in  our  study: 

(i)  restoration  to  as  good  as  new  in  one  step  with  a 
prescribed  restoration  rate  independent  of  the  failure 
state  when  a  (functional)  unit  level  failure  occurs; 

(ii)  restoration  to  as  good  as  new  in  one  step  with  the 
lowest  restoration  rate  for  the  failure  state  when  a  unit 
level  failure  occurs; 

(iii)  restoration  to  as  good  as  new  in  one  step  with 
a  restoration  rate  determined  by  the  sum  of  average 
restoration  times  of  all  failed  subsystems  associated 
with  the  failure  state  when  a  unit  level  failure  occurs; 

(iv)  restoration  to  as  good  as  new  in  multiple  steps 
with  a  restoration  rate  determined  by  the  criterion 
of  quickest  unit  recovery  or  of  the  most  important 
subsystem  recovery  (e.g.,  primary  v.s.  secondary)  when 
a  unit  level  failure  occurs; 

(v)  restoration  to  as  good  as  new  in  one  step  with  a 
prescribed  restoration  rate  independent  of  the  failure 
state  when  any  subsystem  failure  occurs; 

(vi)  restoration  to  as  good  as  new  in  one  step  with  the 
lowest  restoration  rate  for  the  failure  state  when  any 
subsystem  failure  occurs; 

(vii)  restoration  to  as  good  as  new  in  one  step  with 
a  restoration  rate  determined  by  the  sum  of  average 
restoration  times  of  all  failed  subsystems  associated 
with  the  failure  state  when  any  subsystem  failure 
occurs; 

(viii)  restoration  to  as  good  as  new  in  multiple  steps 
with  a  restoration  rate  determined  by  the  criterion 
of  most  speedy  unit  recovery  or  of  most  important 
subsystem  recovery  when  any  subsystem  failure  occurs. 
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Figure  2.4  Rate  transition  diagrams  of  three  types  of 
functional  units 

The  rate  transition  diagrams  corresponding  to  the 
maintenance  policy  stated  in  (i)  are  shown  in  Figure 
2.4  for  the  three  types  of  functional  units  in  Figure 
2.3.  Introduction  of  additional  states  is  necessary  under 
most  of  the  other  maintenance  policies.  The  notations 
used  in  Figure  2.4  are  as  follows.  A  denotes  a  failure 
rate,  p  denotes  a  restoration  rate.  A  subsystem  name 
appearing  in  a  subsection  of  a  state  name  indicates 
that  the  subsystem  is  up.  A  subsystem  name  with 
a  superscript  c  appearing  in  a  subsection  of  a  state 
name  indicates  that  the  subsystem  is  down,  c  with 
appropriate  subscript  denotes  coverage,  and  c  =  1  —  c. 
Superscript  l  means  a  low  value  indicating,  for  example, 
that  a  subsystem  is  in  standby  mode,  and  superscript 
h  means  a  high  value  (when  the  subsystem  is  no  longer 
in  standby).  The  state  marked  by  “Critical  Processor 
Unit  Failure”  has  aggregated  all  CP  unit  failure  states. 

In  general,  for  each  of  the  three  cases  above,  a  set 
of  state  transition  probabilities  can  be  solved  from  the 
forward  Kolmogorov  equation  P(t)  =  P(t)Q ,  P( 0)  = 
I  which  can  be  established  directly  from  balancing 
the  probability  flow  [8]  from  a  rate  diagram  at  each 
state.  Therefore,  transition  rate  matrix  Q  completely 
determines  the  set  of  transition  probabilities.  From  the 
transition  probabilities,  any  state  probability  can  be 
easily  calculated  by  setting  appropriate  initial  state 
conditions.  When  the  number  of  the  states  becomes 
large,  numerical  techniques  and  approximations  must 
be  sought  to  solve  for  the  interested  state  probabilities 
directly  from  p(t)  =  p(t)Q,p( 0)  =  pcb  where  p(t) 
p i  ( /  )  •  •  •  pn(t)\  is  the  state  (row)  vector  for  an  n-state 
functional  unit. 

Since  our  interest  is  in  the  steady  state  availabil¬ 
ity,  the  problem  is  much  simplified.  The  steady  state 
unavailability  can  be  obtained  by  solving  from  the 
algebraic  equation 

PsQs  =  [  0  •  •  •  0  1  ]  (7) 


for  steady  state  probability  vector  ps,  where  Qs  is 
obtained  by  replacing  the  state  equation  involving 
the  derivative  of  unit  failure  state  by  Y17=i  Pi  =  1* 
Arranging  the  states  for  the  redundant  critical  processor 
unit  in  the  order  as  marked  in  Figure  2.3,  and  let 
Ac  =  A  pc,  Ae  =  A  es,  c  =  ces  =  cpc,  P  =  Ape  +  A es , 
and  pl  ( ph )  denotes  a  restoration  rate. 
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The  unavailability  of  the  functional  unit  is  given  by 
the  state  probability  corresponding  to  the  unit  failure 
state.  Figure  2.5  shows  the  results  of  evaluation  the 
unavailability  reduction  factor, 


TTRF  =  1  ~  A C2{ORb )  _  1  - 

“1  -Ac2{ORr)  1  -Ar’ 


(8) 


the  ratio  of  the  unavailability  of  the  baseline  to  that 
of  redundant  architecture  under  maintenance  policy 
(iii)  for  the  SM  unit  (top),  and  for  the  CP  unit 
(bottom),  respectively.  About  a  20  ~  95  time  reduction 
in  unavailability  in  both  units  is  observed  for  the  range 
of  failure  rates  indicated  in  Table  01,  and  for  pl  =  1/24 
hr-1.  Numerical  comparisons  are  also  made  with  respect 
to  different  maintenance  polices  listed  above  using  two 
different  restoration  rates3.  The  results  are  not  shown 
due  to  space  limit. 


Unavailability  Reduction  Factor  (1-Ab)/(1-Ar) 


Figure  2.5  Unavailability  reduction  factor  (URF)  for 
SM  and  CP  units  due  to  redundancy  architecture 
change _ _  _ i _ 


Xpc  =  Ac  (hr  1) 

9.0  x  10-°  ~  10-4 

pl  (hr  1) 

1/24 

A  Vs  =  Ae  (hr-1) 

7.4  x  10“°  ~  10“4 

(hr"1) 

1/4 

A p  (hr-1) 

5.0  x  10“°  ~  10“4 

ORb 

0 

At  (hr-1) 

5.0  x  10-7  ~  10“6 

ORr 

0.99 

A*  (hr-1) 

1.5  x  10“5  ~  10“3 

Table  01.  SM  and  CP  units  failure,  restoration  rates, 
and  OR  numbers 

3 The  authors  would  like  to  thank  Ms.  Xiaoxia  Wang  for  her 
help  in  carrying  out  some  of  the  availability  calculations 
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2.4.  OR  and  air  operations 

It  is  time  to  turn  to  the  overall  air  operation, 
and  investigate  the  benefit  of  a  higher  OR  to  the 
winning  probability  of  Blue.  This  is  an  understandably 
difficult  problem  because  of  the  conceivable  complexity 
in  establishing  the  linkage  between  the  availability 
of  the  controller  in  a  C2  center  and  the  success  of 
an  air  operation,  though  we  have  just  established  a 
definitive  relationship  between  the  availability  and  the 
OR.  Fortunately,  an  earlier  development  [46]  in  our 
effort  has  provided  the  right  framework  to  encourage  an 
attempt.  A  brief  review  of  the  framework  is  in  order. 

A  simple  representation  of  a  strategic  model  is  shown 
at  the  top  of  Figure  2.1.  It  refers  to  the  mathematical 
description  of  the  evolution  of  a  strategic  plan  in  an 
air  operation.  It  takes  the  form  of  a  discrete  state  and 
continuous  time  Markov  process.  The  model  is  specified 
by  (i)  a  state  space  {A},  (ii)  a  set  of  initial  state 
probabilities  {px( 0),x  E  A},  and  (iii)  a  set  of  state 
transition  rates  {A X,x/(t),x,x'  E  X}  from  the  current 
state  x  to  the  next  state  x'  [8].  Figure  2.6  shows  a 
low  resolution  example  of  a  strategic  model  composed 
of  4  binary  states:  (Blue  threatened,  Blue  defeated, 
Red  targeted,  Red  defeated).  A  state  of  (True,  False, 
True,  False)  can  be  represented  by  x  =  1010  in  binary. 
The  meanings  of  the  remaining  states  can  be  similarly 
explained.  States  0000,  1000,  and  1010  in  Figure  2.6 
are  transient  states  and  states  0100,  0101,  and  0001 
are  absorbing  states.  Depending  on  whether  preserving 
the  Blue  assets  besides  destroying  the  Red  assets  is 
also  part  of  the  mission  of  an  air  operation,  the  set  of 
desirable  outcomes  for  Blue  can  be  one  of  {0101,0001} 
and  {0001}. 


Figure  2.6  A  low  resolution  strategic  model 
An  important  set  of  parameters  introduced  in  our 
earlier  work  is  the  set  of  transition  coverage  values 
[46].  A  transition  coverage  associated  with  a  transition 
from  x  to  x'  is  the  conditional  probability  that  the 
intended  transition  in  fact  occurs  given  that  a  triggering 
event  has  arrived.  It  is  denoted  by  cxx,(t),  where  u 
indicates  its  dependence  on  the  control  policy  used 
in  the  tactical  operation  produced  by  the  controller 
in  a  C2  center.  It  can  be  seen  that  a  transition 
coverage  serves  to  effectively  modify  the  corresponding 
transition  rate  via  A X,X'CX  x,(t).  The  transitions  that 
have  transition  coverage  attached  to  them  are  called 


controllable  transitions.  Blue’s  control  objective  is  to 
maximize  the  transition  coverage  under  the  constraints 
of  its  resources  and  battle  dynamics.  The  presence  of 
C2  availability  naturally  modifies  the  originally  defined 
transition  coverage  [46]  for  any  controllable  transition 
from  x  to  x'  in  the  following  manner. 


Q,x'  —  cx,x'AC2,  Cxx,  =  cxx,(  1  -  Ac 2)  +  (1  -  Cx  x,)  (9) 


where 

Clx‘+Cux,x,  =  1  (10) 

forms  the  Poisson  decomposition  [46]  of  the  associated 
transition  rate  \xxr.  The  original  Poisson  decomposi¬ 
tion  becomes  a  special  case  when  C2  availability  is  per¬ 
fect.  It  is  obvious  that  the  introduction  of  an  imperfect 
C2  system  availability  reduces  the  effectiveness  of  the 
controller  in  the  C2  center. 

We  now  examine  the  average  effect  as  well  as  the 
real-time  effect  of  an  imperfectly  available  C2  system 
on  the  air  operation  model  of  Figure  2.6.  The  following 
data  are  used  in  producing  the  result  in  Figure  2.7 
and  Table  02.  Ao?g  =  0.2,  Ag?o  =  0.02,  Ag?4  =  0.04, 
^8,5  =  0.001,  Agpo  =  0.4,  Aio,4  =  0.005,  Aio,i  =  0.05, 
c8)io  =  -95(1  -  ,5e-*/5),  and  ci0)i  =  .95(1  -  .5e-*/10). 
Modifications  in  transition  coverage  are  as  follows. 


^8,10  W  —  c8,10^-E25  CglQ  —  Cgi0(l  —  Ac2)  +  (1  —  c8,lo)> 

<%,1  =  c10,1^E25  and  Cio?1  =  c10,l(l  —  Alc2)  +  (1  —  C^0?i), 

where  various  cases  of  Ac2  considered  are  listed  in  Table 
02.  Function  1(£)  in  Table  02  denotes  the  unit  step 
function.  The  4  hour  and  the  24  hour  time  slots  of 
unavailable  C2  system  correspond  to  the  restoration 
rates  used  in  the  calculation  of  the  previous  subsection. 


The  results  are  shown 


AC2=1(t-4) 


Time  (Jour) 


in  Figure  2.7.  Final  winning 

AC2=0-99  1(t) 


AC2=1(t-24) 

1  4 

0.8  / 

0.8  ! 

0.4  J 
0.2 

0t - . 

0  20  40  60  80  100 

Time  (Hour) 


Figure  2.7  Winning  probabilities  up  to  the  100th  hour 
of  a  military  air  operation 
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C2  availability 

Blue  wins 

Red  wins 

Both  lose 

AC2  =  l(t) 

0.56 

0.35 

0.09 

AC2  =  0.99  x  1  (t) 

0.54 

0.36 

0.10 

Ac2  =  l(t  —  4) 

0.35 

0.58 

0.07 

Ac 2  =  l{t-  24) 

0.01 

0.99 

0.00 

Table  02  Winning  probabilities  at  t  =  100  hours 
when  transient  state  probabilities  have  died  out. 

It  can  be  seen  from  Figure  2.7  and  Table  02  that  a 
slight  reduction  in  C2  availability  has  a  limited  effect  on 
the  outcome  of  the  air  operation  on  average.  However, 
when  the  real-time  unavailability  of  a  C2  system  falls 
within  a  critical  period,  the  outcome  can  be  disastrous. 
The  latter  case  is  shown  in  the  two  plots  at  bottom 
of  Figure  2.7,  and  two  items  at  the  bottom  of  Table 
02.  These  show  where  the  fragility  lies.  An  enhanced 
operational  reconfigurability  can  reduce  the  unavail¬ 
ability  and  hence  fragility  by  2  orders  of  magnitude 
as  shown  in  the  example  of  the  previous  subsection. 
The  reduction  is  achieved  by  filling  in  the  periods  of 
operation  interruptions  with  a  fairly  unsophisticated 
usage  of  existing  redundancy. 

2.5.  Section  summary 

This  section  delineated  the  importance  and  the  poten¬ 
tial  of  being  able  to  provide  and  manipulate  redundancy 
in  the  command  and  control  system  of  a  military  air 
operation.  The  effort  boils  down  to  modification  of  the 
C2  system  architecture  so  as  to  raise  the  system  opera¬ 
tional  reconfigurability.  An  enhanced  OR  helps  reduce 
the  fragility  of  an  otherwise  robust  system.  The  cost 
of  reduction  of  fragility  is  the  extra  complexity  of  the 
system  which  must  include  diagnosis  and  management 
of  redundancy  (or  supervisory  control) .  The  complexity 
introduced,  however,  is  a  miniature  increment  of  a 
more  costly  and  less  carefully  studied  effort  in  setting 
up  monitoring  tools  within  C2  centers.  Some  simple 
but  quantified  case  studies  were  presented  to  support 
our  argument.  Our  ongoing  effort  is  focused  on  more 
detailed  availability  modeling. 
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3.  An  example  of  supervisory  control  in  C2 
3.1.  Problem  desciption 

In  [45],  a  modern  military  air  operation  is  modeled 
as  a  hybrid  feedback  control  system.  The  impact  of 
the  redundancy  architecture  on  the  overall  air  oper¬ 
ation  is  quantified,  by  which  the  functionality  of  the 
controller  residing  in  a  Command  and  Control  Center 
(C2)  is  supported.  The  resilience  of  the  architecture 
that  reflects  the  quality  of  management  of  redundancy 
is  measured  by  operational  reconfigurability.  The  work 
concludes  that  successful  use  of  redundancy  is  crucial  in 
the  reduction  of  fragility  [9]  in  air  operations  in  order 
to  eliminate  any  single  point  failure  in  a  C2  unit  for 
which  a  long  restoration  period,  relative  to  the  mission 
time,  is  required. 

Departing  from  [45]  where  all  C2  units  are  modeled 
as  static  systems,  this  section  considers  a  closed  queuing 
network  model  [8]  for  the  C2  processing  unit  depicted  in 
Figure  3.1.  As  a  consequence,  less  drastically  different 
and  more  realistic  outcomes  are  expected  between 
uncontrolled  and  controlled  operations  when  the  unit 
experiences  failures. 


Figure  3.1  A  C2  processing  unit  as  a  closed  queuing 
network 

The  C2  processing  unit  under  consideration  contains 
two  cells  in  series  serving  to  complete  two  tasks  in 
sequence  for  a  single  class  (class  1)  of  arriving  customers 
on  a  FCFS  basis.  There  are  3  customers  in  the  network. 
The  status  of  a  customer  is  elevated  to  class  2  as  soon 
as  it  completes  the  service  in  cell  1,  and  the  elevated 
status  is  removed  as  soon  as  the  customer  completes 
the  service  at  cell  2.  Each  cell  has  a  server  with  an 
exponential  service  time  (exp(pLi)),  an  exponential  life 
time  (exp(vi)),  and  an  exponential  restoration  time 
(exp(^),  to  as  good  as  new).  The  queue  preceding 
each  server  has  a  sufficient  storage  capacity.  A  more 
complex  model  has  been  built  for  a  more  elaborate 
simulation  study  using  Arena  [23].  This  complex  model 
also  contains  a  database  unit  and  a  human  operator 
unit.  The  simulation  results  are  reported  separately 
in  [48].  The  delays  captured  in  the  three  servers  with 
exponential  delay  times  (exp( A))  in  the  feedback  branch 
are  intended  to  be  also  reflective  of  the  response  times 
of  the  ignored  nodes  in  the  simplified  model.  The 


simplification  allows  us  to  derive  an  analytic  model  for 
scrutinizing  the  effect  of  a  supervisory  control  scheme 
on  the  processing  unit. 

Many  analytical  and  numerical  results  on  queuing 
systems  with  server  failures  can  be  found  in  the  liter¬ 
ature  [1],  [18],  [40].  They  are  mostly  too  specific  with 
regard  to  structural  properties  and  operating  policies  to 
be  widely  applicable.  Our  goal  here  is  to  enhance  the 
availability  of  the  network  without  compromising  the 
network  performance  such  as  the  response  time.  To  that 
end,  a  supervisory  control  scheme  is  to  be  attempted. 
Two  control  inputs  are  introduced:  iq,  applicable  to  the 
first  customer  in  cell  1,  and  u 2,  applicable  to  the  first 
customer  in  cell  2.  Let  Qi,  Si ,  and  Ci  denote  queue  i, 
server  i,  and  class  i,  respectively,  where  i  =  1,2. 

_  J  1,  serve  Ci  at  Si 

Ul  ~  {  0,  serve  Ci  at  S2  if  Si  fails  V  S2  idles 

_  f  1,  serve  C2  at  S2 

U2  ~  1  0,  serve  C2  at  Si  V  preempt-resume  Q 1  if  £2  fails 

As  a  result,  the  operational  reconfigurability  (OR) 
for  the  network  in  Figure  3.1  is  given  by,  according 
to  the  definition  given  in  [45],  OR  =  0  without  the 
supervisory  control  and  OR  =  1  with  the  supervisory 
control.  This  implies  a  perfect  monitoring  that  aquires 
the  state  information  in  the  network.  Otherwise  OR  < 
1. 

In  Subsection  B,  a  Markov  model  for  the  processing 
unit  in  Figure  3.1  incorporating  the  above  supervisory 
control  is  derived.  In  Ssubsection  C,  two  measures  of 
the  unit’s  performance  are  evaluated  with  and  without 
activating  the  supervisory  control.  One  measure  is  the 
unit’s  steady-state  response  time  to  a  customer  (from 
entering  cell  1  to  existing  cell  2),  and  the  other  is  the 
steady-state  availability  which  is  roughly  the  fraction  of 
time  the  network  is  up  while  the  servers  are  not  idled. 
Section  IV  discusses  the  effect  of  supervisory  control  on 
the  overall  air  operation,  and  a  possible  extension  of 
this  work. 

3.2.  Markov  model  of  the  processing  unit 

With  some  abuse  of  the  notations,  the  states  in 
our  Markov  model  are  named  Q1Q2S1S2 ,  where  Qi  G 
{0, 1,  2,  3}  is  the  queue  length  at  cell  i  with  Qi  +Q2  <  3, 
and  Si  G  {0,1}  represents  server  i  intact  (0),  or  failed 
(1).  Figure  3.2  shows  all  the  states  in  rectangular  boxes 
and  all  the  transition  rates  by  the  arcs  linking  the  states. 
An  alternative  state  name  is  marked  by  a  circled  number 
from  {1,  •  •  • ,  40}  near  the  corresponding  state.  The  state 
transition  rates  are  denoted  by  pairs  in  the  form  of  a,  b 
where  a  is  the  transition  rate  from  a  state  named  by 
smaller  decimal  number  to  a  state  named  by  a  larger 
decimal  number  and  b  the  rate  back  to  the  smaller 
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number. 


next  subsection. 


Figure  3.2  Rate  transition  diagram  of  the  Markov 
model  for  the  C2  processing  unit  in  Figure  3.1 

Denote  the  rate  transition  matrix  by  Q.  Q  is  a  40  x 
40  matrix.  Only  the  nonzero  entries  involving  control 
signals  in  Q  are  listed.  These  signify  the  exertion  of 
supervisory  control  actions  to  state  transitions  in  the 
way  defined  in  Subsection  A.  In  particular,  q 7^  =  M2(l~ 

^2),  #11,8  =  Ml(l  ~  ^l),  #15, 7  =  Ml^2,#19,15  =  M  1^2 > 
#19,10  =  ^2(1  -  ^2),  #23,20  =  M l(l  —  ul)i  #27,15  =  P l(l  — 
^2),  #30,27  =  Ml^2,  #30,19  =  ^(1  “  ^2),  #33,30  =  Ml^2, 
#33,22  =  /i2(l  -  1x2),  #36,34  =  P i(l  -  ^i)-  Note  that 
Ui  =  1  —  Ui  in  Figure  3.2. 

In  general,  a  set  of  state  transition  probabilities  can 
be  solved  from  the  forward  Kolmogorov  equation  P(t)  = 
P(t)Q ,  .P(O)  =  I  which  can  be  established  directly  from 
balancing  the  probability  flow  [8]  from  a  rate  diagram 
at  each  state.  Since  our  interest  is  in  the  steady  state 
probabilities,  the  problem  is  much  simplified.  Let  pi 
denotes  the  steady  state  probability  for  state  i.  Then 
p  =  [pi  •  •  •  P40]  can  be  solved  from 

p  Q  =  0 

with  one  of  its  40  scalar  equations  replaced  by 

40 

=  i- 

i=i 

At  A  =  6,  p  =  pi  =  p2  =  12,  =  h>2  =  0.005,  7  = 

1/24,  for  example,  the  set  of  steady-state  probabilities 
shown  in  Figure  3.3  are  obtained.  All  the  quantities 
above  carry  the  unit  of  a  rate,  i.e.  ( time  unit)-1 .  Many 
steady  performance  measures  can  be  computed  from 
these  probabilities.  Two  of  them  will  be  detailed  in  the 


State  Probability  p 


Figure  3.3  Steady-state  prpbabilities  without  (upper) 
and  and  with  (lower)  supervisory  control 

3.3.  Response  time  and  availability  comparison 

The  mean  response  time  E[TR\  can  be  solved  for  the 
processing  unit  by  applying  Little’s  Law  to  the  top  and 
the  bottom  parts  of  the  queuing  network  in  Figure  3.1 
[8].  It  has  the  expression 

77117-'  1  _ ^ _ 1 

R  //(l  —  Prob[S2  not  busy  ])  A 

Let  Tb  be  the  response  time  of  the  processing  unit 
without  supervisory  control,  and  Tr  be  the  response 
time  with  supervisory  control.  Note  that  the  subscripts 
are  inherited  from  [45]  where  b  stands  for  baseline 
and  r  stands  for  redundant.  Define  the  response-time 
reduction  factor  as 

RRF= 

Tr 

Figure  3.4  shows  that  RRF  always  improves  with 
the  use  of  supervisory  control.  RRF  increases  with 
increasing  service  rate,  most  prominently  when  delay 
time  1/A  is  small,  and  RRF  increases  with  increasing 
failure  rate,  most  prominently  when  restoration  rate  is 
high. 


M- 
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Figure  3.4  Response  time  reduction  factor  as  func¬ 
tions  of  /i  (upper)  and  v  (lower)  with  A  (upper)  and  7 
(lower)  as  parameters 

The  unavailability  of  the  processing  unit  is  computed 
by  the  sum  of  state  probabilities  that  contribute  to  the 
failure  of  the  unit.  For  example,  without  the  application 
of  supervisory  control,  the  unit  unavailability  is  given 

by 

Ub  =  Vl  +  Pll  +  P14  +  P15  +  P18  H - b  P39  +  P40- 

With  the  application  of  supervisory  control  the  unit 
unavailability  is  given  by 

Ur  =  P14  +  Pl8  +  P26  +  P29  +  P32  +  P37  +  P38  +  P39  +  P40- 
Define  the  unavailability  reduction  factor  as 

URF= 

Figure  3.5  shows  that  URF  always  improves  with  the 
use  of  supervisory  control.  URF  generally  decreases 
with  increasing  service  rate,  most  prominently  for  some 
particular  value  of  delay  time  1/A,  in  this  case  1  unit. 
URF  always  decreases  with  increasing  failure  rate  v. 


x'  for  Blue,  u  is  the  control  action  taken  in  the  tactical 
air  operation,  and  Cx  x,  +  CX  x,  =  1.  Therefore,  the  same 
conclusions  as  those  drawn  in  [45]  hold,  which  have  been 
stated  at  the  beginning  of  Subsection  A. 

The  supervisory  control  applied  to  the  processing 
unit  in  this  is  by  no  means  the  best  possible  control 
policy  because  it  is  set  based  on  common  sense  rather 
than  guided  by  a  rigorous  criterion.  It  is  possible  that 
criteria  be  formed  by  introducing  costs  associated  with 
each  control  action  or  inaction.  The  policy  derived 
under  the  criteria  would  be  optimal  in  the  sense  that  it 
minimizes  some  total  cost  index.  This  is  called  a  Markov 
decision  problem  [8],  and  its  solution  to  enhancing  C2 
reconfigurability  will  be  pursued  in  the  near  future. 


Figure  3.5  Unavailability  reduction  factor  as  functions 
of  /i  (upper)  and  v  (lower)  with  A  (upper)  and  7  (lower) 
as  parameters 

It  is  also  observed  that  under  the  same  structural 
parameters,  URF  is  several  times  larger  when  the  C2 
processing  unit  is  regarded  as  a  static  unit  than  that 
when  it  is  regarded  as  a  dynamic  (queuing  system)  unit. 
This  justifies  our  effort  in  queuing  network  modeling. 

3.4.  Section  summary 

The  effect  of  the  C2  availability  on  the  winning 
probability  of  an  air  operation  involving  two  opposing 
sides  named  Blue  and  Red  (enemy)  can  be  folded  into 
the  values  of  transition  coverage  of  a  high  level  strategic 
model  [46]  of  the  air  operation.  In  particular, 

Clx,  =  c\x,Ac 2,  Clx,  =  c“x,(l  -  Aq2)  +  (1  -  cl x,), 

where  Ac 2  is  the  C2  availability,  c ™  x,  is  the  coverage  for 
transition  from  strategic  state  x  to  a  desirable  strategic 
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4.  Fault-tolerance  of  multihop  wireless  networks  (R) 
4.1.  Problem  description 

The  class  of  wireless  networks  under  consideration  is 
the  class  of  multiple- hop,  distributed  networks  consist¬ 
ing  of  a  large  number  of  nodes.  Each  node  has  a  limited 
energy  supply  that  cannot  be  replenished,  and  is  capable 
of  packet  transmission,  reception,  and  processing  that 
involves  detection,  fusion,  coding  and  decoding.  Our 
goal  is  to  maximize  the  network  reliability  at  its  design 
life  Td4. 

Our  main  challenge  is  to  develop  a  power  covariate 
network  reliability  model5.  As  a  result,  the  network 
reliability  becomes  the  overarching  measure  that  encom¬ 
passes  aspects  of  symbol  error  rate,  energy  efficiency, 
bandwidth  efficiency,  the  effect  of  clustering,  and  the 
effect  of  feedback. 

Many  algorithms  have  been  developed  for  the  com¬ 
putation  of  node-pair  reliability  of  networks,  which  is 
the  probability  that  at  least  one  route  exists  between 
a  source  node  and  a  terminal  node  (Torrieri,  1994). 
Unlike  any  other  networks,  however,  each  route  in  our 
network  itself  forms  a  sub-network  with  an  additional 
structure  bound  by  the  cooperative  transmission  scheme 
used.  Therefore,  we  confine  ourselves  to  the  sub-network 
of  a  iGcluster  route  through  which  packets  hop  from 
cluster  1  to  cluster  K.  The  restriction  to  the  single-route 
problem  is  entirely  due  to  our  intention  to  capitalize 
on  some  new  physical  layer  transmission  schemes  (Li 
and  Wu  2003;  Li  2003,  2004).  Our  interest  is  not  in 
devising  routing  protocols  (Ordonez  et  ah,  2004)  that 
enhance  the  network  connectivity  evaluated  using  the 
knowledge  of  the  spacial  distribution  of  the  wireless 
nodes  (Xue  and  Kumar,  2004),  or  prolong  network 
lifetime  assessed  using  the  deterministic  knowledge  of 
energy  expenditure  at  each  node  (Bhardwaj  et  ah, 
2002).  Instead,  we  are  seeking  to  understand  and  to 
optimize  the  temporal  evolution  of  network  reliability 
and  to  utilize  this  information  in  the  network  operation 
with  little  supervising  activity. 

Existing  schemes  for  enhancing  the  network  fault- 
tolerance  all  carry  significant  overhead  in  terms  of 
energy  consumption.  Examples  of  such  schemes  include 
multiple-path  routing  (Ganesan  et  ah,  2002),  packet 
replication  (De  et  ah,  2003),  or  feedback  between 
neighboring  nodes  that  either  acknowledges  a  successful 
reception  or  requests  a  re-transmission  of  a  packet 
(Kumar,  2001).  Unlike  more  traditional  networks,  such 
as  the  Internet,  where  highly  reliable  links  contribute 

4 A  design  life  is  defined  as  the  maximum  time  by  which  a  pre¬ 
scribed  network  reliability  RD  is  maintained,  i .e.,Fnet(t)\t=TD  = 
1  -RP.  where  Fnet(t)  is  the  cumulative  distribution  function  of 
the  network  time  to  failure. 

5The  network  reliability  is  given  by  Rnet(t)  =  1  —  Fnet(t ), 
which  is  defined  as  the  probability  that  the  network  performs 
successfully  its  required  function  over  a  period  of  t  time  units 
under  the  stated  operating  conditions. 


little  to  the  transmission  failures,  the  links  of  wireless 
networks  are  much  less  reliable  as  a  result  of,  for 
example,  severe  channel  fading,  or  limited  standalone 
reliability  of  low-cost  nodes,  or  energy  depletion  of 
nodes.  On  the  other  hand,  redundancy  is  abundant  in 
such  networks.  Therefore,  opportunities  exist  to  address 
the  issues  of  fault-tolerance  and  energy  efficiency  simul¬ 
taneously.  Of  particular  interest  is  the  question  on  how 
much  feedback  is  needed  at  a  certain  level  of  redundancy 
usage  for  a  prescribed  network  design  life. 

With  a  proper  formulation  of  a  cooperative  transmis¬ 
sion  problem  employing  multiple  nodes,  transmission 
diversity  can  be  provided  to  combat  deep-fading  suffered 
by  the  near-ground  communications  (Laneman  and 
Wornell,  2003;  Sendonaris  et  ah,  2003).  The  exist¬ 
ing  cooperative  diversity  schemes,  though  efficient  in 
transmission  power,  increase  the  circuit  energy  con¬ 
sumption  associated  with,  for  example,  static  current 
in  transceivers  and  encoding/decoding  circuity,  when 
multiple  nodes  must  be  kept  on  for  listening  and 
reception  (Ganesan  et  ah,  2002).  We  are  developing 
new  cooperative  transmission  schemes  to  address  power 
efficiency,  bandwidth  efficiency,  and  fault-tolerance  si¬ 
multaneously.  Our  preliminary  simulation  results  (Li 
and  Wu,  2003)  indicated  a  6-fold  reduction  in  power 
consumption  at  an  enhanced  level  of  network  reliability 
with  a  two-node  cluster  that  achieves  a  15 dB  signal 
to  noise  ratio  at  the  receiving  cluster.  This  can  be 
implemented  using  a  new  space-time  block  coding 
technique  (Li,  2003  and  2004)  with  no  loss  of  bandwidth 
efficiency. 

Little  has  been  discussed  at  the  physical- layer  in 
terms  of  network  fault-tolerance  (Hoblos  et  ah,  2000) 
up  to  this  point.  Our  basic  idea  is  to  determine  the 
level  of  redundancy  appropriate  for  our  cooperative 
transmission  scheme  that  also  maximizes  the  network 
reliability  at  its  design  life.  Since  high  cross-correlation 
among  packets  exists  under  this  scheme,  a  certain 
packet  loss  rate  could  be  tolerated  without  having  to 
incur  energy  loss  associated  with  frequent  feedback  and 
re-transmission. 

The  section  is  organized  as  follows.  In  Subsection 
B,  a  re-transmission  chain  is  formed  that  serves  to 
motivate  the  quest  for  understanding  the  impact  of 
loop-closure  on  the  network  reliability.  Subsection  C 
discusses  modeling  the  life  time  distribution  of  a  node, 
and  deriving  the  network  level  reliability  and  its  lower 
bound  as  a  function  of  link  reliabilities.  Subsection 
D  applies  the  link  reliabilities  for  the  assignment  of 
active  nodes  to  clusters  to  maximize  the  network  fault- 
tolerance  up  to  its  design  life.  It  also  tackles  the  re¬ 
transmission  issue  as  a  Markov  decision  problem  with 
partial  information  feedback. 
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4.2.  A  motivating  example 

The  Markov  chain  in  Figure  4.1  describes  a  if -cluster 
packet  transmission  process  where  state  name  i  stands 
for  the  ith  cluster  within  which  a  packet  hopping  from 
the  source  through  the  network  to  the  destination  is 
residing.  This  chain  is  non-homogeneous  due  to  the 
deteriorating  link  reliability  p\  as  the  network  ages.  The 
link  reliability,  to  be  evaluated  in  the  next  subsection,  is 
the  probability  that  a  packet  reaching  the  ith  cluster  is 
successfully  relayed  to  the  i  +  lth  cluster  with  a  required 
power  level. 

q,  called  a  supervisory  coverage,  in  Figure  4.1  is 
the  conditional  probability  that  upon  the  failure  of  the 
first  transmission  attempt,  a  re-transmission  command 
is  successfully  issued  to  cluster  i.  In  an  unsupervised 
environment,  ci  =  1  for  the  first  transmission  attempt, 
and  Ci  =  0  for  any  re-transmissions.  In  a  supervised 
environment,  on  the  other  hand,  0  <  q  <  1  in 
general  (Wu,  2004).  The  factors  affecting  q  include 
lack  of  observability  of  state,  or  erroneous  state  esti¬ 
mation,  failure  of  a  supervising  node  or  cluster,  fading 
channel  linking  the  supervising  cluster  and  cluster  i, 
and  collision  among  packets  in  which  case  a  more 
elaborate  queuing  network  model  becomes  appropriate. 
Therefore,  in  a  truly  distributed  environment,  it  is 
reasonable  to  assume  that  q  <  p\.  u(i) in  Figure  4.1 
is  the  re-transmission  control  action  when  state  i  is 
entered.  For  the  moment,  u(i)  =  1  and  time- invariant 
p\  are  assumed. 


- picMi ) 


1-Pk-ICk-MK-I) 


P11C]U(1)  pl2c2u(2)  plK_1cK_1u(K-l ) 

Figure  4.1  Packet  transmission  in  a  if-cluster  route 
With  the  Markov  chain  established,  the  state 


probability  p?,  i.e.,  the  probability  that  a  packet  is  in 
cluster  i,  can  be  calculated  by  solving  recursively  for 
t u(k)  =  \pi(k)  ■■■  pcK(k )]  from  wk+1  =  TTkVk,k+ 1,  where 
Vk,k+i  transition  probability  matrix  (Cassandras  and 


‘  consumes  power 

Pi  The  average  number  of  transmissions  needed  to  reach 
state  i  +  1  can  be  shown  to  be  Ni  =  1  /p*Q,  i  = 
1,  *  •  • ,  K  —  1.  Then  the  power  usage  per  packet  trans¬ 
mission  through  the  network,  or  power  efficiency  can  be 
estimated  by  P  =  Pi(^iPi):  and  E  =  J2t=o  P  is 
an  estimate  of  the  network  energy  efficiency  over  its  life 
time.  Here  the  notion  of  the  network  age  t  is  specialized 
to  the  number  of  packet  transmissions  that  the  network 
has  carried  out  so  far  with  the  assumption  all  clusters 
age  uniformly  and  the  number  of  redundant  nodes  in 
each  cluster  is  large. 

Let  us  consider  two  simple  but  representative  cases.  In 
the  first  case  there  are  no  feedback  and  no  supervisory 
activity,  i.e.,  Ci  =  0  for  all  re-transmissions,  while  the 


cluster  transmission  with  multiple  nodes  is  used.  In  the 
second  case  a  supervisory  scheme  is  in  place  to  issue  re¬ 
transmission  whenever  needed,  while  only  a  single  node 
in  a  cluster  is  used  at  a  time  for  each  transmission 
attempt. 


k(p!  =  0.99) 


Single-node  cluster  with  re-transmission 


k  (p!  EE  0.9,  C.  EE  0.9) 


Figure  4.2  2-node  w/o  feedback  v.s.  1-node  w/  feed¬ 
back 


Figure  4.2  shows  10  snapshots  of  state  probabilities 
for  a  5-cluster  route  when  a  packet  transmission  is  ini¬ 
tiated  at  k  =  1  for  the  above  mentioned  two  representa¬ 
tive  cases.  The  five  rows  of  the  plots  are  pf  through  p§  at 
10  consecutive  instants  of  packet  transmissions.  The  left 
column  of  plots  is  for  the  unsupervised  case,  where  the 
link  reliability  p\  =  0.99  for  all  i  at  the  current  network 
age,  resulting  from  a  2-node  cooperative  transmission 
with  a  reliability  of  0.9  for  each  node.  For  the  moment 
perfect  channels  are  assumed,  in  which  case  a  link 
reliability  is  the  same  as  a  cluster  reliability.  The  right 
column  of  plots  is  for  the  supervised  case,  where  the 
current  link  reliability  p\  =0.9  for  all  i,  resulting  from 
a  1-node/transmission  scheme  with  a  node  reliability 
also  0.9,  and  a  supervisory  coverage  q  =0.9. 

The  following  can  be  observed,  (i)  Without  feedback, 
the  network  reliability  \\fS^p\  depends  solely  on  the 
individual  link  reliabilities.  Therefore,  high  link  reliabil¬ 
ity  is  crucial,  especially  for  a  route  with  a  large  number 
of  hops.  Given  the  limited  standalone  node  reliability 
and  channel  fading  phenomena,  high  link  reliability  is 
not  possible  without  using  a  multiple-node  cooperative 
transmission  scheme,  (ii)  Feedback  enables  the  network 
to  eventually  settle  in  its  absorbing  state  at  the  expense 
of  power  and  bandwidth  expenditures.  More  specifically, 
it  takes  an  average  of  1.23  transmissions  to  send  a  packet 
to  the  next  cluster  in  this  example,  which  leads  to  less 
power  efficiency,  and  more  delay.  In  conclusion,  it  is 
most  desirable  to  have  a  supervisory  scheme  that  is, 
however,  rarely  called  for  under  high  coverage  and  high 
link  reliability  conditions. 

Our  remaining  tasks  have  become  obvious:  to  assess 
and  maximize  link  reliabilities,  and  to  devise  a  re¬ 
transmission  stopping  rule  that  abandons  a  route  when 
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it  becomes  a  liability  to  the  network. 

4.3.  Network  reliability 

4.3.1.  Node  and  channel  reliability  models:  Due  to 
dependence  on  power  consumption,  time  to  failure 
distribution  of  a  node  must  be  of  increasing  failure  rate 
(IFR),  i.e.,  a  node  that  is  found  to  be  good  after  some 
usage  must  have  a  shorter  residual  life  than  a  brand 
new  node.  Weibull  IFR  distribution 

(  t  ^(p(J)) 

Fn(t)  =  l-rn{t)  =  1-e  (11) 

is  used  as  an  example  in  this  section,  where 
ft(P(J))  >  1  is  called  a  shape  parameter  and  6(P(J))  > 
0  is  called  a  characteristic  life.  The  Weibull  model  is 
deemed  covariate  because  of  its  explicit  dependence 
of  the  parameters  on  power  P{J)  joules /packet /node 
involving  an  J-node  cooperative  transmission.  For  sim¬ 
plicity  P(J)  will  be  suppressed  in  the  following  discus¬ 
sion.  t  is  now  the  identified  with  the  number  of  packets 
the  node  has  relayed.  The  characteristic  life  can  be 
scaled  by  l/Ni  to  reflect  the  additional  life  expenditure 
due  to  the  need  of  re-transmission  at  the  ith  cluster. 

For  a  given  type  of  node  and  a  family  of  distributions, 
the  parameters  of  the  distribution  can  be  determined 
statistically  (Casella,2002).  Suppose  at  a  fixed  power 
level,  an  n-unit  concurrent  test  is  performed.  The  test 
terminates  at  the  arrival  the  rth  node  failure,  i.e.,  upon 
the  observation  of  failure  times  {ti,  •  •  •  ,tr}.  The  max¬ 
imum  likelihood  estimates  of  the  Weibull  parameters 
can  be  solved  from 


?+y>g*< 
ft 


i= 1 


\  tf  log  ti  +  (n  —  r)t%  log  tr  —  0 

i= 1 


?  +  i  tf  +  (n  -  r)t%  =  0. 
R  n2  Z— ✓  4  v  ' 


ft  e2 

In  addition,  Mann’s  two-parameter  F- test  can  be  per¬ 
formed  to  determine  whether  to  reject  the  hypothesized 
Weibull  with  a  specified  significance  level  (Zacks,1992). 
The  empirical  dependence  of  ft  and  0  on  P(J)  can 
be  established  by  repeating  the  experiments  for  many 
power  levels. 

Let  Tic  denote  the  period  of  loop  closure,  indicating 
how  often  a  node  is  checked  out  to  determine  whether 
it  has  failed.  Assuming  a  uniform  aging  process,  the 
residual  life  distribution  Fk(t)  =  P[T  <  t\T  >  (fc— 1)T/C] 
of  a  node  follows 

r?(t) 


Fk  (t)  =  1  — 


r?((fc-l)TIc)’ 


t  >  (k  —  l)Tzc,  kMl,  2,. 


Single  channel  failure  distribution  is  assumed  to  be 
time  independent,  and  identical  for  all  channels  in  the 
network,  i.e.,  r\  =  rc,  unless  some  a  priori  information 
is  available,  which  can  be  easily  incorporated.  The 
randomness  is  associated  with  the  fading  phenomena 
(Rappaport,2002) . 


4.3.2.  Link  and  network  reliability:  Suppose  the  ith 
cluster  of  the  iL-cluster  network  contains  a  total  of 
Ii  nodes.  Suppose  for  every  sequence  of  Ii  requests  of 
packet  transmission  that  arrive  at  the  ith  cluster,  a  node 
responds  to  a  set  of  Ji  consecutive  requests.  In  such  an 
arrangement  which  will  be  called  a  participating/non¬ 
participating  protocol  hereafter,  the  burden  of  packet 
transmission  for  every  node  is  effectively  reduced  to  a 
fraction  Ji/Ii,  and  the  single  node  characteristic  life 
0i  is  increased  effectively  to  Oili/Ji.  Note  again  that 
the  current  age  of  a  node  is  the  number  of  packet 
transmissions  the  node  has  carried  out  so  far.  This 
protocol  unifies  the  node  ages  across  a  cluster. 

The  reliability  of  the  iL-cluster  network  is  now 
considered.  The  example  in  Figure  4.3(a)  depicts  a 
portion  of  an  interconnection  containing  two  nodes  in 
each  cluster,  where  Sj  denotes  the  jth  node  in  the  ith 
cluster,  and  Cl ■  k  denotes  the  channel  linking  the  jth 
node  in  the  ith  cluster  to  the  kth  node  in  the  i  +  1th 
cluster.  The  consideration  of  channel  failures  turns  the 
interconnection  into  a  nested  structure  rather  than  a 
cascade  structure. 


Hop  i-1  Hop  i  Hop  i+1 


Figure  4.3  (a)  Dependence  diagram,  (b)  simplification 
The  nested  structure  in  Figure  4.3a  can  be  de¬ 
composed  into  logic  stages  for  which  the  output  signal 
availability  can  be  computed  when  conditioned  on  the 
input  signal  availability  using  a  combinatorial  method. 
More  specifically,  one  may  write  for  the  ith  hop  in 
Figure  4.3(a) 

i  i  i  s~vi  cii  i  i  i  i  i 

Vl  —  CuOiUi  +  021^2^2,  i/2  —  F12O1U1  +  022^2^2  ? 

for  which  sixteen  conditional  probabilities 

P(yiU2  =  ab\u\u2  W  cd ),  a,  6,  c,  d  G  {0, 1} 

can  be  computed  with  a  ‘1’  representing  availability 
of  a  signal  and  a  ‘0’  unavailability.  For  example,  with  t 
suppressed,  it  can  be  shown  that 

P(y\yl  =  oo|  u\u\  =  11) 

=  (1  -  r?)2  +  2(1  -  rc)2r? (1  -  r?)  +  (1  -  rc)*(r?)2 

P(y\y\  =  01  K«2  =  n)  =  p{y\yl  =  10 \u\u\  =  11)  /12n 

=  2rc(l  -  rc)r?(  1  -  r?)  +  (2rc  -  (rc)2)(l  -  rc)2(r?)2 

P(yiyz2  =  iiDi  A  =  n) 

=  2(rc)2r™ (1  -  rzn)  +  (rc)2(l  -  rc)2(rf)2 
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The  stages  are  linked  by  u ^+1  =  y\,  u\  =  y\  1,  u l2+1  = 
Vh  and  u\  =  ^2_1- 

Extension  of  the  above  result  from  a  2-node  clusters 
to  a  J^-node  cluster  is  straightforward,  and  can  be  car¬ 
ried  out  in  a  systematic  manner.  Nevertheless,  reliability 
evaluation  of  the  nested  structure  is  a  major  hurdle  for 
optimization,  especially  in  real-time.  It  is  therefore  de¬ 
sirable  to  work  with  simpler  network  reliability  models 
that  provide  bounds  on  the  nested  network  reliability. 
For  example,  with  a  /^-out-of-Ji  (Zacks,1992)  require¬ 
ment  based  on  cooperative  transmission  considerations, 
where  ki  is  the  required  minimal  number  of  operative 
nodes  and  Ji  is  the  number  participating  nodes  in  the 
ith  cluster,  Rnet  is  bounded  below  by 

[(rc)Ji+hnr[l  -  (rc)Ji+1r™] Ji_r,  (13) 

i= 1  r=kj  '  ' 

which  comes  from  the  decomposed  cascade  of  func¬ 
tional  units  as  shown  in  Figure  4.3(b).  Note  that 
Jk+ i  =  0  because  no  further  transmission  is  needed 
at  cluster  K.  The  lower  bound  is  equivalent  to  the 
configuration  of  Figure  4.3(a)  in  that  signals  initiated 
from  node  Sj  can  reach  every  participating  node  in 
hop  i  - 1-  1  if  and  only  if  every  channel  Cl-k  is  intact  for 
the  given  i,  j,  and  for  all  k  E  {1,  2,  •  ■  • ,  Ji+i}.  This 
implies  a  channel  reliability  riz+1 .  It  is,  however,  not 
necessary  that  every  channel  must  work  to  guarantee 
the  information  flow  through  the  network,  hence  the 
conservativeness.  Let  R{3  be  the  probability  that  a 
packet  reaches  at  least  ki 44  nodes  among  the  Ji+i 
participating  nodes  in  cluster  i  +  1  with  the  required 
power  level,  given  that  the  packet  is  transmitted  at 
cluster  i  from  at  least  ki  nodes  among  Ji  participating 
nodes.  Denote  by  the  ith  term  in  the  product  in  (13)  as 
R? Lit  can  be  shown  that  0  <  Rp  —  R?4,  <  1  —  (rc)Ji+1. 
The  error  bound  is  tight  as  long  as  channel  reliability 
is  high,  and  the  number  of  participating  node  in 
cooperative  transmissions  is  not  excessively  large.  Many 
of  the  analyses  from  this  point  on  will  use  the  lower 
bound  (13),  including  the  definition  of  link  reliability, 
i.e.,  p\  =  R^,  and  composite  network  reliability  Rnet  = 
p[  x  •  •  •  x  plK.  Now,  the  participating  node  allocation 
problem  becomes  amendable  to  solutions  using  dynamic 
programming  (Bellman,  1957). 

4.4.  Optimization  and  control 

This  subsection  discusses  two  applications  of  the 
derived  link  reliabilities. 

4.4.1.  Participating  node  allocation:  Our  task  is  to 
determine  the  values  of  J\ ,  J2 ,  •  •  • ,  Jr  so  that  the  net¬ 
work  reliability  is  the  largest  at  Tjj  without  violating  a 
bandwidth  constraint.  In  cluster  i,  Ji^min  is  imposed  by 
the  particular  transmission  scheme,  while  Ji:max  <  h  is 
mainly  imposed  by  the  available  bandwidth.  Bounding 
model  (13)  converts  the  network  level  decision  into  a 


series  of  coupled  cluster  level  decisions.  In  this  case, 
channel  failures  introduce  only  local  coupling  which  can 
be  resolved  by  an  ordered  selection  process  starting 
from  Jr  at  the  last  cluster  and  ending  at  J\.  The 
solution  {JI,  •  ■  • ,  }  can  then  be  inserted  to  the 

staged  conditional  probability  formulae  (12)  to  calculate 
the  true  network  reliability. 

To  illustrate  the  basic  idea,  consider  a  3-cluster 
network  with  10  nodes  in  each  cluster.  A  tree  structure 
shown  in  Figure  4.4  can  be  created  to  represent  all 
possible  solutions  at  Tr,  where  all  branches  violating 
the  constraints  have  been  trimmed.  Constrains  partic¬ 
ular  to  the  cooperative  transmission  scheme  (Li  and 
Wu,  2003)  are  Y^=i  Ji  —  12  and  Ji,min  —  2.  Each 
joint  of  the  tree  at  a  given  cluster  index  represents 
a  possible  cumulative  number  of  nodes.  Each  branch 
leading  to  the  joint  carries  a  cost  equal  to  R^  (Tp)  for 
a  particular  Ji.  The  accumulated  reliability  for  each 
passage  from  the  root  to  a  leaf  can  be  computed  using 
Bellman’s  principle  of  optimality  (Bellman,  1957).  The 
principle  is  applied  at  every  unit  index  i  by  comparing 
all  the  accumulated  reliabilities  leading  to  the  same 
joint.  Only  the  solution  of  the  highest  reliability  is 
retained  at  each  joint,  and  the  rest  are  removed.  Once 
the  set  {J{,  J2  ,  •  •  • ,  Jr}  is  obtained,  the  link  reliabilities 
are  set  to  p\  =  Rii  ,i  =  1,  2,  •  •  • ,  K  —  1.  Suppose  unit 
reliabilities  R%(Td)  through  R%(Td)  have  been  found 
to  be  0.85,  0.90,  0.95,  0.99,  0.995,  0.999,  and  0.9995, 
respectively,  for  the  network  in  Figure  4.4,  the  optimal 
node  allocation  derived  using  dynamic  programming  is: 
Ji  =  4  for  i  =  1,  2,  3. 


Figure  4.4  Trellis  diagram  for  node  allocation 
Note  that  unit  reliability  is  a  complex  function  of 
Ji,  which  is  determined  by  the  methods  discussed 
in  subsections  3.1  and  3.2.  The  optimization  in  this 
subsection  is  carried  out  under  the  assumption  that 
network  is  operating  unsupervised.  It  is  possible  to 
re-optimize  the  network  reliability  projected  at  the 
network  design  life  when  supervisory  exists  that  can 
report  the  actual  rather  than  the  predicted  status  of  the 
nodes.  A  commonly  used  idea  called  a  receding  horizon 
optimal  control  in  the  control  literature  (Mayne,  1990) 
can  be  applied  in  this  case.  Though  only  limited  data 
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exchange  is  required  to  carry  out  dynamic  programing, 
the  main  challenge  with  real  time  optimization  in  a 
distributed  environment  is  that  data  exchange  is  not 
only  expensive  but  unreliable.  How  frequently  such  a 
partial  reorganization  should  be  performed  is  currently 
under  investigation. 

4.4.2.  Re-transmission  control:  In  this  subsection, 
the  re-transmission  chain  of  subsection  2  is  revisited. 
It  is  now  assumed  that  the  network  is  supervised  to 
the  extent  that  it  can  detect  a  cluster  transmission 
failure  but  not  the  state  of  the  nodes  and  channels,  and 
the  participating/nonparticipating  protocol  is  effective 
to  manage  the  large  number  of  nodes  available  at 
each  cluster.  The  decision  regarding  re-transmission  in 
each  of  the  clusters  upon  the  detection  of  a  cluster 
transmission  failure  can  be  made  based  on  the  solution 
of  a  Markov  decision  problem.  The  main  purpose  is  to 
be  able  to  terminate  the  service  of  the  LC-cluster  route 
so  that  it  does  not  turn  into  a  black  hole  in  the  network. 

Let  Xk  £  {1,  2,  •  •  • ,  K}  denote  the  random  state 
variable  at  t  =  k  in  the  chain.  Control  action  u{xk)  = 
1  (or  0)  indicates  the  network’s  decision  to  (or  not 
to)  re-transmit  a  packet.  Let  C(xk^Uk)  be  the  cost 
incurred  when  control  action  u &  is  taken  based  on 
Xk-  Our  goal  is  to  determine  a  re-transmission  policy 
7 r  that  minimizes  the  total  expected  cost  Vir(xk  = 
i)  =  En  XlfcLo  C{Xk)Uk).  It  has  been  shown  that  under 
the  condition  0  <  C ( j,  u)  <  oo  for  all  j  and  all  u 
that  belongs  to  some  finite  admissible  sets  Uj,  j  = 
1,  2f  *  •  • ,  K  —  1,  the  minimal  cost  V*(i)  satisfies  the  fol¬ 
lowing  optimality  equation  (Cassandras  and  Lafortune, 
1999) 

K 

V(i)  =  minueUi{C(i,u)  + 

3  = 1 

In  addition,  policy  7 r*  is  optimal  if  and  only  if  it  yields 
V*{i)  for  all  i. 

Referring  to  the  Markov  chain  in  Figure  4.1,  the 
optimality  equation  can  be  specialized  to  the  following 
form. 


V(i)  =  minueUi{u(i)Ti  +  [1  —  u(i)\Li  + 

)) 

+u  \pi,i(u(i))V (i)  +Pi,i+i(u(i))F(i  +  1)]}  (14) 

s - v - ' 

where  Pi^  =  1  —  p\c^  Pi^+i  =  p\ci ,  where  network 
age  t  is  suppressed,  is  the  power  and  bandwidth 
cost  incurred  when  the  network  chooses  to  re-transmit 
a  packet,  and  Li  is  the  packet  loss  cost  incurred  when 
the  network  chooses  not  to  re-transmit.  (14)  can  be 
expressed  as 

V(i)  =  min{Ti  +  Pi,iV(i)  +Pi,i+iV(i  +  1  ),L*} 


To  gain  some  insight  into  the  optimal  policy,  assume 
Ti  =  T,  Li  =  L,  Pij+i  =  r,  and  =  1  -  r  for  i  = 
1,  •  •  • ,  K  —  1.  Since 

(1  -  r)V(j)  +  rV(j  +  !)<(!  —  r)V{i)  +  rV[i  +  1) 


as  long  as  j  >  i,  the  optimal  policy  is  of  the 
threshold  type  (Cassandras  and  Lafortune,  1999)  with 
some  threshold  i*,  i.e. , 


F(i): 


\_)  T/r  +  F(i  +  1),  i  >  i* 


i  <  i* 


Mi)  =  i) 

(■u(i)  =  0) 


Given  that  V(K)  =  0,  V(i)  can  be  solved 


V(i)=\  f-W* 


i  >  i  , 

•  /  •  * 
i  <  i  , 


(«(i)  =  1) 

(■u(i)  =  0) 


from  which  the  threshold  is  obtained 

rL 
~T 


i*  =  \K-rE- 


[.]  denotes  the  smallest  nonnegative  integer  greater 
than  K  —  rL/T.  It  can  be  seen  that  the  optimal  policy 
favors  a  re-transmission  when  a  packet  is  near  the  end 
of  the  if-cluster  route  (large  i),  when  a  cluster  is  young 
(large  r),  when  the  cost  of  a  packet  loss  is  large  (large 
L),  when  power  &  bandwidth  are  cheap  (small  T), 
when  a  route  is  short,  (small  K).  A  study  without  the 
simplifying  assumptions  along  this  direction  is  ongoing. 

4.5.  Section  summary 

In  this  section,  fault-tolerance  of  a  K- hop  wireless 
network  with  a  cooperative  transmission  scheme  is 
measured  by  the  network  reliability  projected  at  its 
design  life.  The  network  is  subject  to  both  channel 
fadings  and  node  failures.  A  node  life  time  is  modeled 
by  a  Weibull-like  covariate  model  where  both  the 
characteristic  life  and  the  shape  parameters  are  dictated 
by  the  node  power  consumption.  Assessment  of  network 
reliability  is  accomplished  by  the  composition  of  staged 
conditional  probabilities.  Upper  and  lower  bounds  on 
the  network  reliability  are  obtained  that  disentangle 
the  nested  interconnection  of  nodes  and  channels  into 
a  chain  of  links  with  decoupled  link  reliabilities. 

An  analysis  framework  has  been  established  in  this 
section  for  qualifying  fault-tolerance  in  wireless  net¬ 
works  that  age  more  rapidly  as  its  node  power  is  being 
consumed.  Improving  fault-tolerance  is  shown  to  require 
a  high  supervisory  coverage  with  a  frequency  of  loop 
closure  as  low  as  affordable  by  the  high  link  reliability. 
A  more  thorough  investigation  is  ongoing  on  the  effect 
of  the  loop  closure  frequency  on  the  network  fault- 
tolerance. 
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5 .  Effect  Of  Acknowledgement  On  Performance  Of 
A  Fault-Tolerant  Wireless  Network 

5.1.  Problem  desciption 

This  section  studies  the  same  K-hop,  single-route  wireless 
network,  shown  in  Figure  V.l,  as  that  considered  in  [47], 
however  with  several  realistic  constraints  included.  The 
objective  is  to  quantify  the  effect  of  loop  closure  frequency 
and  the  nodes’  storage  capacity  on  the  performance  of  the 
network  in  terms  of  network  lifetime,  and  packet  loss  rate. 
The  constraints  considered  are  node’s  finite  packet 
processing  time,  node’s  life  expenditure  while  performing 
supervisory  activities,  and  a  specific  re-transmission 
protocol.  As  a  result  of  the  added  complexity,  it  becomes 
necessary  to  resort  to  numerical  means  in  order  to  achieve 
our  objective.  A  discrete  event  simulation  tool  called  Arena 
[35]  is  used  for  this  purpose. 

Background  information  summarizing  the  conditions 


Hop  i-l  Hop/  Hop  i+1 


Figure  5.1  Dependence  diagram  of  3  clusters  of  nodes  and 
channels  in  a  K-cluster  wireless  network 

The  example  in  Figure  5.1  depicts  a  portion  of  an 
interconnection  containing  two  nodes  in  each  cluster,  where 

Slj  denotes  the /h  node  in  the  zth  cluster,  and  Cl-k  denotes  the 

channel  linking  the  /h  node  in  the  zth  cluster  to  the  kth  node  in 
the  z+7th  cluster.  Modeling  channel  failures  turns  the 
interconnection  into  a  nested  structure  rather  than  a  cascade 
structure.  This  model  is  used  in  [47]  to  understand  the  effect 
of  the  level  of  cooperation  in  transmission  on  network 
reliability  at  its  design  life5,  as  well  as  the  benefit  and  cost  of 
feedback. 

Each  node  in  Figure  5.1  has  a  limited  energy  supply  that 
cannot  be  replenished,  and  is  capable  of  transmitting  and 
receiving  symbols  in  packets,  and  processing  signals,  which 
involves  detection  and  fusion,  coding  and  decoding, 
modulation  and  demodulation,  as  well  as  channel  estimation. 
The  restriction  to  the  single-route  problem  is  entirely  due  to 
the  intention  to  capitalize  on  some  new  physical  layer 
transmission  schemes  [30],  [28],  [29]  rather  than  on  the 
routing  protocols. 


5  network’s  design  life  is  defined  as  the  maximum  time  by  which  a 
prescribed  reliability  of  the  network  is  maintained. 


In  [47],  lifetimes  of  the  nodes  are  modeled  with  power- 
covariate  distributions  of  increasing  failure  rates.  A  method 
for  assessing  both  link  and  network  reliabilities  projected  at 
the  network’s  design  life  is  developed.  The  link  reliability  is 
then  used  to  allocate  active  nodes  to  clusters  through 
dynamic  programming  to  maximize  the  network’s  fault- 
tolerance,  and  to  establish  a  re-transmission  control  policy 
that  minimizes  the  expected  cost  involving  power,  bandwidth 
expenditures,  and  packet  loss. 

A  Markov  chain  model  is  established  in  [47],  as  shown  in 
Figure  5.2,  to  capture  the  high-level  effect  of  feedback, 
where  state  name  i  stands  for  the  zth  cluster  within  which  a 
packet  hopping  from  the  source  through  the  network  to 
destination  is  residing. 


p[ciu{l)  pl2c2u(2)  PK-^K-luiK-!) 

Figure  5.2  Packet  transmission  in  a  K-cluster  route 

ct ,  called  a  supervisory  coverage,  is  a  conditional 
probability  that  upon  the  failure  of  a  first  transmission 
attempt  to  the  z+7th  cluster,  a  re-transmission  command  is 
successfully  issued  to  cluster  i.  The  factors  affecting  ct 

include  lack  of  state  observability,  fading  channel  from  the 
receiving  node/cluster  to  transmitting  node/cluster.  It 
captures  decision  risks  in  supervisory  activities.  u(i)  is  the  re¬ 
transmission  control  action  when  state  i  is  entered.  A  re¬ 
transmission  is  attempted  when  u(i)  =  7  .  This  chain  is  non- 
homogeneous  due  to  the  dependence  of  link  reliability  p.  on 

cluster  age.  The  link  probability  is  the  probability  that  a 
packet  reaching  the  zth  cluster  is  successfully  relayed  to  the 
z+7th  cluster  with  a  required  power  level,  and  depends  on 
both  channel  and  node  reliabilities. 

The  following  can  be  observed  from  the  Markov  model,  (i) 
Without  supervisory  activities,  the  network  reliability 
depends  solely  on  the  product  of  all  link  reliabilities. 
Therefore,  high  link  reliability  is  crucial,  especially  for  a 
route  with  a  large  number  of  hops.  Given  the  limited 
standalone  node  reliability  and  channel  fading  phenomena, 
high  link  reliability  is  not  possible  without  using  clustered 
cooperative  transmission  scheme,  (ii)  Feedback  enables  the 
network  to  eventually  reach  the  destination  at  the  expense  of 
power  and  bandwidth  expenditures,  and  time  delays. 

To  determine  whether  to  retransmit  in  case  of  a  transmission 
failure,  [47]  assumes  that  the  network  is  supervised  to  the 
extent  that  it  can  detect  a  cluster  transmission  failure  but  not 
the  state  of  the  nodes  and  channels.  The  decision  regarding 
re-transmission  in  each  of  the  clusters  upon  the  detection  of  a 
cluster  transmission  failure  is  made  based  on  the  solution  of 
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a  Markov  decision  problem.  The  main  purpose  is  to  be  able 
to  terminate  the  service  of  the  Tf-cluster  route  so  that  it  does 
not  turn  into  a  black  hole  blocking  the  traffic  of  other  routes 
that  intersect  it. 

This  section  will  investigate  the  dependence  of  network 
lifetime  and  packet  loss  rate  on  the  frequency  of 
acknowledgements  from  the  receiving  nodes,  and  on  the 
storage  capacity  of  the  nodes  for  both  arriving  packets  and 
the  copies  of  transmitted  packets  in  case  of  a  transmission 
failure.  The  purpose  of  acknowledgement  is  to  terminate  the 
activity  of  the  ^-cluster  route  so  that  the  lives  of  operative 
nodes  are  saved  for  other  usages.  No  re-transmission  of 
packets  is  considered  in  this  section.  On  the  other  hand,  the 
following  simplifying  assumptions  used  in  [47]  are 
eliminated  in  this  section:  processing  time  of  packets  within 
a  node  is  zero;  supervisory  activities  incur  no  power 
expenditures;  supervisory  scheme  has  no  association  with 
any  particular  protocols.  Elimination  of  these  simplifying 
assumptions  complicates  our  tasks  dramatically  that  it  forces 
us  to  resort  to  numerical  means  for  performance  evaluation. 

The  section  is  organized  as  follows.  Subsection  B  describes 
a  particular  feedback  protocol  implemented  in  Arena  for  the 
next  cluster  to  acknowledge  the  receipt  of  a  certain  number 
of  packets.  The  section  also  highlights  modeling  of  the  5-hop 
network  of  Figure  5.1  with  Arena.  Subsection  C  defines  a  set 
of  performance  measures  that  include  network  lifetime, 
packet  loss  rate,  and  false  alarm.  It  then  details  the  analyses 
of  performance  based  on  the  simulation  output.  Subsection  D 
discusses  the  implications  on  network  design  based  on  our 
simulation  study,  limitations  of  our  work,  and  areas  to  be 
addressed  in  the  future. 

5.2.  Acknowledgement  protocol  and  network  modeling  with 
ARENA 

Referring  to  Figure  5.1,  each  receiving  node  at  the  next 
cluster  transmits  an  acknowledgement  (ACK)  to  the 
transmitting  cluster  in  the  previous  cluster  after  receiving  a 
sequence  of  N  packets;  this  is  regarded  as  a  feedback. 
Therefore,  the  loop  closure  period  is  N.  If,  for  example,  a 
transmitting  node,  after  transmitting  a  string  of  DL  packets, 
where  DL>N,  does  not  receive  an  acknowledgement  ( ACK) 
from  the  receiving  node,  it  assumes  that  all  receiving  nodes 
in  the  cluster  have  failed.  In  this  event,  the  transmitting 
simply  stops  sending  packets,  and  the  network  life  ends.  The 
supervisory  coverage  [42]  in  this  case  is  the  conditional 
probability  that  upon  the  reception  of  N  packets,  an  ACK 
command  issued  by  one  of  the  receiving  nodes  successfully 
reaches  one  of  the  working  transmitting  nodes.  In  a  network 
with  redundant  nodes,  it  is  not  necessary  that  every  channel 
must  work  to  guarantee  the  information  flow  through  the 
network. 

In  spite  of  an  analytic  approach  proposed  in  [47],  the 
evaluation  of  network  reliability  for  the  nested  structure  is 


tedious  and  is  a  major  hurdle  for  optimization.  The 
additional  constraints  described  in  the  previous  subsection 
completely  rules  out  the  possibility  of  analytical  approach 
for  performance  analysis.  For  this  reason,  the  model  in 
Figure  5.1  is  constructed  with  Arena  [35].  Arena  is  a 
general-purpose  simulation  tool  of  discrete  event  systems. 
Though  it  does  not  have  the  network-oriented  convenience 
afforded  by  specialized  tools  for  networks,  it  offers  the 
flexibility  for  us  to  model  in  detail  many  aspects  of  networks 
that  have  not  been  studied  by  others. 

The  model  of  the  wireless  network  shown  in  Figure  5.3  is 
constructed  using  different  modules  in  Arena  that  are 
arranged  into  a  number  of  templates  such  as  ‘Basic  Process’, 
‘Advanced  Process’  and  ‘Advanced  Transfer’  [23].  The 
Basic  Process  template  contains  modules  that  are  used  in 
modeling  packet  arrival  and  packet  departure,  assigning 
attributes  to  packets,  channel  random  fading,  and  node 
processing.  The  Advanced  Process  panel  comprises  specific 
logical  functions  such  as  a  fictitious  control  logic  unit  that  is 
used  to  match  the  incoming  packets  depending  on  their 
attributes,  duplicate,  and  merge  packets.  Finally,  the  Route 
module  in  Advanced  Transfer  template  is  used  to  transfer  the 
packets  to  specified  stations.  Independent  replications  are 
performed  for  each  simulation  of  the  wireless  system  model 
and  the  simulation  results  are  stored  and  reported. 

In  the  5-hop  wireless  network  of  Figure  5.3,  the  data  packets 
are  generated  at  the  source  of  the  network  with  a  Poisson 
rate.  They  pass  through  channels  and  nodes,  and  are 
delivered  to  sink.  For  simplicity,  assume  that  the  source  and 
sink  do  not  fail  over  time.  There  is  no  transport  time  for 
passing  through  channels.  The  processing  time  at  each  node 
is  fixed  at  T  units  of  time  per  packet.  A  packet  can  be  lost 
through  a  faded  channel,  in  a  failed  receiving  node,  in  a 
failed  transmitting  node,  or  due  to  a  collision.  The  channels 
have  independent  failure  probabilities.  The  nodes  have 
failure  times  that  follow  independent  Weibull  distributions. 
Weibull  distribution  allows  a  more  truthful  description  of  a 
node’s  life  in  our  application  where  a  node  that  is  found  to 
be  good  after  some  usage  will  have  a  shorter  residual  life 
than  a  brand  new  node,  while  the  used  node  found  to  be 
good  is  indistinguishable  from  a  new  one  if  it  is  modelled  by 
an  exponential  lifetime  distribution. 


Hopi-J  Hop/  Hop/tJ 


The  major  source  of  randomness  for  the  wireless  network  is 
linked  to  node  failures.  Suppose  SBh  SB2 ,  ...  are  the  busy 
time  data  of  a  node  and  F~  is  the  distribution  fitted  for 

B 

these  data.  Then  the  node  is  working  until  its  total 
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accumulated  busy  (processing)  times  reaches  a  value  sbl,  at 
which  point  the  busy  node  fails. 

Each  receiving  node  at  the  next  hop  transmits  an 
acknowledgement  ( ACK)  to  a  transmitting  node  of  the 
previous  hop  after  receiving  a  sequence  of  N  packets.  N  is 
referred  to  as  the  loop  closure  period.  This  transmission  of 
acknowledgement  to  the  previous  node  acts  as  a  feedback. 
The  transmitting  node  of  the  previous  cluster  inhibits  the 
packet  transmission  to  the  next  sensor  if  it  does  not  receive 
the  acknowledgement  within  a  certain  deadline  (DL).  This 
deadline  forces  the  transmitting  node  to  wait  for 
acknowledgement  beyond  the  loop  closure  period  in  case 
some  packets  are  lost  to  announce  that  a  receiving  node  has 
failed.  To  address  the  issue  of  how  acknowledgements 
process  is  handled,  consider  the  case  if  the  packets  numbered 
7,  2,  3,...N  have  been  transmitted.  The  transmitting  node 
waits  for  an  acknowledgement  from  the  receiving  nodes  until 
it  receives  an  acknowledgement  after  which  it  resets  its 
counter,  or  until  the  deadline,  in  those  circumstances  it 
discontinues  sending  packets  and  declares  the  end  of 
network  life.  A  false  alarm  is  said  to  have  occurred  if  all 
transmitting  nodes  cease  to  transmit  packets  at  the  end  of  the 
deadline  even  though  some  receiving  nodes  are  still 
operative. 


5.3.  Performance  analysis  via  simulation 

This  subsection  investigates  the  dependence  of  network 
performance  on  frequency  of  acknowledgements  from  the 
receiving  nodes  and  on  the  storage  capacity  of  the  nodes. 
Two  aspects  of  storage  capacity  are  considered.  As  a 
transmitting  node  that  is  responsible  for  re-transmission,  a 
copy  of  a  transmitted  packet  is  stored  for  as  long  as  the  set 
deadline  for  the  reception  of  an  acknowledgement  of  that 
packet.  Since  this  section  does  not  deal  with  re-transmission, 
only  a  counter  is  needed.  As  a  receiving  node,  a  received 
packet  may  be  allowed  to  wait  in  a  buffer  for  its  turn  to  be 
processed  at  a  node  while  a  previously  received  packet  is 
being  processed.  Performance  consideration  includes  time  to 
network  failure,  packet  loss  rate,  and  false  alarm  rate.  The 
wireless  network  is  simulated  both  with  and  without 
feedback.  The  importance  of  selection  of  appropriate 
deadline,  buffer  size,  and  loop  closure  period  to  the  network 
performance  is  delineated. 

Designing  and  analyzing  simulation  experiment  depends  on 
the  type  of  simulation  [27].  The  performance  measures  of 
interest  in  this  study  necessitate  terminating  simulations.  The 
terminating  condition  for  the  wireless  network  materializes 
when  the  network  fails,  which  occurs  when  there  is  no  longer 
passage  of  packets  from  source  to  sink.  This  could  be  due  to 
the  failure  of  all  nodes  in  a  cluster,  or  due  to  a  false  alarm 
that  occurs  when  an  acknowledgment  deadline  is  passed 
even  though  there  are  still  surviving  nodes  in  the  receiving 
cluster. 


For  a  given  scenario,  n  independent  replications  of  a 
terminating  simulation  are  run  where  each  replication  is 
terminated  as  soon  as  a  network  failure  is  declared,  and  is 
begun  with  the  same  initial  condition  of  an  empty  and  fully 
operative  network.  The  behavior  of  the  network  is  studied 
based  on  apposite  data  collected  in  the  course  of  simulation 
and  the  performance  measures  of  interest  are  estimated  using 
the  data.  As  the  number  of  collected  data  sample  n  increases, 
that  is  as  n  —>  °° ,  the  sample  mean  of  the  performance 
estimates  from  the  multiple  independent  replications 
converges  almost  surely  to  the  true  mean  of  the  underlying 
distribution  of  the  performance  measure,  based  on  the  Strong 
Law  of  Large  Numbers  [8]. 

The  packet  inter-arrival  time  is  exponentially  distributed 
with  mean  time  varied  at  0.1,  0.25  and  0.35  sec.  The  service 
time  of  each  node  is  fixed  at  T=  0.02  sec.  The  channels  are 
opted  to  have  an  independent  failure  probability  of  0.01.  The 
failure  time  distribution  of  a  node  is  Weibull  whose  shape 
parameter  a  and  scale  parameter  [3  can  be  determined 
statistically  if  the  failure  data  of  L  concurrent  tests  are 
available  [47].  The  mean  of  Weibull  distribution  is  given  by 
(/3/cc)  T\l/a),  where  r  is  the  complete  gamma  function.  For 
chosen  parameter  values  of  (3  =  50,  a  =  3  in  the  5-  hop 
wireless  network,  the  mean  failure  of  the  node  occurs  after 
serving  about  2233  packets. 

Time  to  network  failure  (TNF)  is  defined  as  the  expected 
number  of  packets  received  at  the  sink  before  the  network 
fails.  Using  simulation  data,  time  to  network  failure  is 
estimated  and  is  characterized  as  the  average  of  total  number 
of  packets  received  at  the  sink  before  the  terminating 
condition  occurs  over  multiple  replications.  Suppose  N  f  is 
the  total  number  of  packets  that  reach  the  sink  when  the 
simulation  terminates  for  the  zth  single  run  of  the  network. 
The  simulation  is  run  n  times  with  the  same  initial 
conditions.  Data  N f  ,N f ,N„  are  collected  and  are  used 
to  obtain  the  time  to  network  failure  as  a  point  estimate  of 
the  form, 

A  1  n  j? 

TNF  n  =  —  L  Nj 
n  i=i 

The  (l  -  a)  confidence  interval  of  the  estimate  is  given  by 
\TNFn  -  tn_i,aj2  ffjn,  TNFn  +  tn.1>ai2-Jspn] ,  where 

TNF n-  E {TNF  } 

follows  a  Student’s  t  distribution  of  n-1  degrees  of  freedom, 
and  the  sample  variance  is  given  by 

?  1  n  j?  A  ? 

S2n= - -1(N«-TNF„)2  ■ 

n-1  i=i 

This  method  is  called  the  method  of  independent  replications 

[1]. 
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Consider  the  case  when  the  feedback  is  applied  to  the 
wireless  network,  with  N=5,  DL=25,  inter-arrival  time  an 
exponential  distribution  of  mean  time  0. 1  sec ,  and  the  rest  of 
the  specifications  remain  the  same.  Then,  with  n  =  30,  the 
time  to  network  failure  is  estimated  as 

TNF 30  =  —  Z  NtR  =  884 .5  packets, 

30  i=1 

and  the  half  width  of  95%  confidence  interval  is  determined 
as 

tn-l,a/2^Sn  /n  =105.29  . 

Thus,  the  95 %  confidence  interval  for  TNF  is  779. 21<  TNF 
<989.79. 


Time  to  Network  Failure  vs  Loop  Closure  Period 
Service  time  =  0.02  sec 


(a) 


Time  to  Network  Failure  vs  Deadline 
Loop  Closure  Period  =  5,  Service  time  =  0.02  sec 
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Figure  5.4  Time  to  network  failure  with  respect  to  (a) 
loop  closure  period,  (b)  acknowledgment 
deadline 


To  assure  that  the  selected  number  of  simulation  runs  is 
sufficiently  large  to  uphold  the  central  limit  theorem,  a 
simple  test  can  be  performed  to  confirm  whether  the  data 
resembles  a  normal  distribution.  If  not,  more  replications  are 
required. 


the  deadline  increases.  A  transmitting  node  assumes  that  a 
receiving  node  has  failed  if  it  does  not  receive 
acknowledgement  from  the  receiving  node  within  a  deadline. 
Hence,  false  alarm  rate  increases  with  a  shorter  deadline. 

In  several  types  of  wireless  networks,  accurate  and  complete 
reception  of  packets  is  of  paramount  importance;  hence, 
packet  loss  rate  is  a  significant  performance  measure.  To 
estimate  the  packet  loss  rate,  two  sets  of  data  are  collected. 
First  set  collected  from  the  zth  replication  is(7vf  -N*), 
which  is  the  difference  between  the  total  number  of  packets 
created  in  the  source  and  that  received  in  the  sink  in  the  zh 
replication  by  the  time  the  termination  condition  is  met.  The 
second  set  collected  from  the  zth  replication  is  N?  ,  the  total 

packets  generated  in  the  source  by  the  time  the  termination 
condition  is  met. 


Packet  Loss  Rate  vs  Deadline 
Loop  Closure  Frequency  =  5,  Service  time  =  0.02  sec 
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Packet  Loss  Rate  vs  Loop  Closure  Period 
Service  time  =  0.02  sec 


(b) 

Figure  5.5  Packet  loss  rate  with  respect  to  (a)  loop 
closure  period  and  (b)  deadline 


The  sample  function  is  defined  as 


Simulation  analysis  of  the  5-hop  network  inclusive  of  the 
feedback  mechanism  indicates,  as  shown  in  Figure  V.4,  that 
the  time  to  network  failure  increases  with  increasing  period 
of  loop  closure  ( N)  for  a  given  deadline,  and  with  increasing 
deadline  for  a  given  loop  closure  period.  The  former  is 
attributed  to  the  increased  life  expenditure  associated  with 
more  frequent  supervisory  activities  that  consume  extra 
power.  The  latter  has  to  do  with  reduced  false  alarm  rate  as 


N?  -  N? 

N? 

which  represents  the  percentage  loss  with  respect  to  total 
packets  generated  in  the  zth  replication.  The  point  estimate  of 
packet  loss  rate  can  be  obtained  as 
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a  1  n  N? - N  * 
PLR  =  —  Z  '  ‘ 

n  i=i  Ni 


From  Figure  5.5,  it  is  evident  that  the  packet  loss  rate 
decreases  with  increasing  loop  closure  period,  and  increasing 
deadline  for  all  inter  arrival  rates.  This  is  because  a  larger 
loop-closure  period  implies  less  node  power  expenditures  in 
supervisory  activities  and  hence  more  likely  success  in 
transmission,  and  a  extended  deadline  implies  a  lowered 
false  alarm  rates  and  hence  an  effectively  longer  network 
life.  As  inter-arrival  time  increases,  the  PLR  decreases, 
because  the  chance  of  packet  collision  decreases. 

Also  of  interest  is  the  estimate  of  false  alarm  rate  (FAR),  for 
which  the  sample  function  is  defined  in  terms  of  an  indicator 
function  as  follows 


{7,  if  false  alarm  occurs  in  the  zth  replication 
0 ,  otherwise 


Recall  that  a  false  alarm  occurs  when  a  transmitting  node 
does  not  receive  acknowledgement  from  any  of  the  receiving 
nodes  at  the  time  of  a  deadline  while  some  of  the  receiving 
nodes  are  still  alive.  Then  the  point  estimate  for  the  false 
alarm  rate  is  defined  as 


Packet  loss  rate  with  respect  to  buffer 
Interarrival  =  expo(O.I)  sec,  Service  time  =  0.02  sec 


Figure  5.7  Packet  loss  rate  with  respect  to  loop 
closure  period  with  buffer  size  as  parameter. 


5.4.  Section  summary 

In  this  section,  some  aspects  of  performance  of  a  K  -  hop 
wireless  network  subject  to  both  channel  fading  and  node 
failures  are  evaluated.  The  goal  is  to  understand  the  effect  of 
supervisory  activities,  in  particular,  the  acknowledgement  of 
reception  of  packets  on  the  number  of  packets  the  battery- 
powered  network  can  transmit  and  the  probability  of  success 
of  transmission  (7-  packet  loss  rate). 
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False  alarm  rate  vs  Deadline 
Loop  Closure  Period  =  5,  Service  time  =  0.02 
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Figure  5.6  False  alarm  rate  with  respect  to  deadline 


It  is  obvious  that  false  alarm  rate  decreases  with  increasing 
deadline  until  there  is  no  more  benefit  with  further  extension 
of  deadline  beyond  which  network  fails  almost  surely  before 
false  alarm  occurs,  as  shown  in  Figure  5.6. 


The  packet  loss  rate  (PLR)  is  also  examined  against  the 
buffer  size  at  a  node.  For  the  given  range  of  arrival  rate, 
buffer  size  of  7  is  determined  to  be  sufficient,  as  shown  in 
Figure  V.7.  The  reduction  in  packet  loss  rate  when  a 
sufficiently  large  buffer  is  in  place  is  mainly  attributed  to  the 
effective  avoidance  of  collision. 


In  the  following,  the  implications  of  our  analysis  to  network 
design  will  be  discussed,  so  will  some  future  work. 

The  need  for  acknowledgment  arises  when  there  is  a  benefit 
for  the  nodes  to  know  when  to  stop  transmitting.  Introducing 
acknowledgment,  however,  always  incur  a  cost  in  terms  of 
both  network  lifetime  and  transmission  success. 

The  only  way  to  reduce  the  impact  of  the  acknowledgement 
to  network  lifetime  is  to  keep  a  sufficiently  low  loop  closure 
rate  that  minimizes  a  combined  measure  of  energy 
expenditure  before  and  after  network  failure.  The  energy 
expenditure  after  the  network  failure  is  attributed  to  the 
additional  power  consumption  in  operative  nodes  due  to  the 
delay  in  applying  a  stopping  rule. 

On  the  other  hand,  a  sufficiently  large  buffer  size  can 
effectively  reduce  the  chance  of  packet  collision,  which  in 
turn  reduces  packet  loss  rate.  In  our  study,  however,  the 
utilization  of  individual  nodes  is  low.  The  very  high  packet 
loss  rate  observed  is  therefore  largely  attributed  to  the  lack 
of  reliability  of  the  nodes  and  the  channels.  The  multiple-hop 
environment  accentuates  the  effect  of  unreliability. 
Therefore,  higher  level  of  redundancy  becomes  necessary. 

The  matter  becomes  much  more  complicated  if  re¬ 
transmission  of  packets  is  considered.  In  that  case  a  lower 
loop  closure  rate  implies  a  longer  delay  for  a  packet  to  pass 
the  network.  Therefore,  one  must  consider  trading  off  delay 
against  extra  power  consumption  and  bandwidth  contention 
that  come  with  a  more  frequent  loop  closure  rate.  In  this 
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case,  however,  as  long  as  the  storage  capacity  is  sufficient, 
packet  loss  can  be  reduced  to  practically  none  in  the  early 
life  of  the  network.  Again  optimal  level  of  redundancy 
should  be  sought  with  respect  to  all  competing  interests  to 
prolong  the  network  life.  A  protocol  for  re-transmission  is 
conceivably  more  complex  than  the  protocol  for 
acknowledgement.  Therefore,  we  propose  to  consider  in  the 
future  re-transmission  with  which  network  lifetime  will  be 
evaluated  against  energy  and  bandwidth  efficiency,  and  time 
delay,  and  supervisory  coverage  will  be  estimated  for  a  given 
protocol. 
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6 .  Supervisory  Control  of  a  Database  Unit 


Query 


6.1.  Problem  description 

HE  recent  effort  to  install  and  test  monitoring  tools  and 
to  increase  the  level  of  redundancy  in  critical  subsystems 
in  air  operation  centers  [45]  has  provided  opportunities  for 
vast  performance  improvement  in  its  command  and  control 
(C2  hereafter)  supporting  systems.  Our  previous  work  on  a 
controlled  C2  processing  unit  [44]  has  demonstrated  that 
reduced  response  time  to  service  requests  and  shortened 
periods  of  system  unavailability,  as  a  result  of  automated 
monitoring  and  control,  can  raise  significantly  the  probability 
to  attain  the  desired  outcome  in  an  air  operation.  This  section 
shifts  focus  to  one  other  critical  C2  subsystem,  a  database 
unit.  A  simulation  study  [49]  has  been  performed  recently 
using  Arena  [35],  [23]  on  a  controlled  database  unit.  The 
results  indicate,  however,  that  the  architecture  shown  in 
Figure  6.1  is  extremely  inefficient,  where  the  service  burden 
rests  almost  entirely  on  the  primary  server,  while  the 
secondary  server,  though  indispensable  for  the  required 
system  availability,  is  rarely  utilized. 

Figure  6.2  shows  an  alternative  architecture  for  which  the 
potential  improvements  in  response  time  and  in  service 
availability  are  to  be  examined.  The  partition  of  the  database 
into  multiple  sets  of  data  (to  be  called  data  classes  hereafter), 
and  the  simultaneous  access  to  multiple  servers  allow  the 
reduction  of  the  response  time  to  queries,  whereas  the 
presence  of  a  secondary  data  class  in  every  server  leads  to 
fault-tolerance  and  therefore  higher  service  availability.  The 
performance  improvement,  however,  cannot  be  achieved  in  a 
cost-effective  manner  without  a  reconfiguration  scheme 
called  a  supervisory  control  that  acts  on  the  state  information 
of  the  database  system.  This  effort  investigates  several  such 
schemes  that  differ  by  their  control  authorities.  To  assess  the 
effectiveness  of  these  schemes  in  a  quantified  manner,  the 
model  in  Figure  6.2  (and  that  in  Figure  6.1)  is  given  the 
interpretation  of  a  queuing  network  [39]  with  specific  sets  of 
operating  policies  and  structural  parameters.  The  control 
authorities  considered  include  the  ability  to  restore  the  lost 
data  and/or  the  ability  to  route  queries.  In  order  to  obtain  an 
analytic  model  of  manageable  size  for  scrutinizing  the  effects 
of  supervisory  control,  the  archiving  process  is  ignored,  and 
the  queuing  network  is  of  the  closed  type  [8].  A  simulation 
study  is  being  conducted  currently  without  these 
simplifications. 

The  section  is  organized  as  follows.  Subsection  B  models 
the  database  system  in  Figure  6.2  as  a  Markov  chain  [22] 
with  supervisory  control.  Subsection  C  evaluates  a  set  of 
performance  measures  under  several  supervisory  control 
policies.  Subsection  D  concludes  the  section.  Details  of  the 
database  model  are  given  in  Appendix. 


Figure  6.1  Redundant  database  unit 


Figure  6.2  Partitioned  database  unit 


6.2.  Modeling  and  control 
6.2.1.  Modeling 

The  database  unit  in  Figure  6.2  contains  three  servers  in 
parallel  to  answer  three  classes  (A,  B,  C)  of  queries  for 
which  relevant  information  can  be  found  in  the  partitioned 
sets  A,  B,  C  of  the  database,  respectively.  Server  SAB 
contains  database  class  A  as  the  primary  class  and  database 
class  B  as  the  secondary  class.  Server  SBC  contains  database 
class  B  as  the  primary  class  and  database  class  C  as  the 
secondary  class.  Server  SCa  contains  database  class  C  as  the 
primary  class  and  database  class  A  as  the  secondary  class. 
The  failure  of  a  server  implies  the  loss  of  two  classes  of  data 
within  the  server.  A  system  level  failure  is  declared  when 
two  servers  fail,  in  which  case  one  class  of  data  is  said  to  be 
lost.  The  queues  preceding  servers  SAB,  SBC ,  and  SCA  are 
named  QAC ,  QBc ,  and  QCA ,  respectively.  All  queues  are  of 
sufficient  capacity.  Service  is  provided  on  a  FCFS  basis  at 
each  server. 

The  three  delay  elements  imply  that  there  are  always  three 
customers  present  in  the  unit  at  any  given  time.  A  new  query 
is  generated  at  a  delay  element  upon  the  completion  of  the 
service  to  a  query  at  one  of  the  servers.  The  delay  elements 
are  intended  to  be  also  reflective  of  the  response  time  to  the 
querying  customers  by  other  service  nodes  in  the  C2 
supporting  system,  which  are  not  explicitly  modeled.  Any 
new  query  is  assumed  to  be  equally  likely  to  seek  database 
class  A  or  B  or  C.  Therefore  routing  probabilities  pAB ,  pBC , 
and  pCA  are  assigned  the  same  values  under  the  normal 
operation  condition. 

The  use  of  a  queuing  network  model  for  the  database  is 
based  on  its  suitability  to  involve  control  actions  and  our 
intention  to  capture  their  effects  on  the  system  performance. 
The  model  is  built  in  this  study  with  the  premise  that  event 
life  distributions  have  been  established  for  the  process  of 

query  generation  (exp(T)  =  1  -  e~ ^ ) ,  the  process  of  service 
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completion  (exp(// )) ,  the  process  of  server  failure  (exp(v)) , 
the  process  of  data  restoration  (exp(y)) ,  and  the  process  of 
unit  overhaul  (exp (co))  when  the  failed  database  unit  is 
repaired.  All  such  processes  are  independent.  Standard 
statistical  methods  that  involve  data  collection,  parameter 
estimation,  and  goodness  of  fit  tests  [53]  exist  for  identifying 
event  life  distributions.  Since  all  event  lives  are  assumed  to 
be  exponentially  distributed,  the  database  unit  can  be 
conveniently  modeled  as  a  Markov  chain  specified  by  a  state 
space  2C,  an  initial  state  probability  mass  function  (pmf) 
TiffS),  and  a  set  of  state  transition  rates  A,  [8]  [22].  The 
reader  uninterested  in  the  details  of  model  building  can 
advance  to  the  paragraph  right  above  Equation  (15). 

6. 2. 1.1.  State  space  JC 

A  state  name  is  coded  with  a  d-digit  number  indicative  of 
all  queue  lengths  and  server  states  in  the  unit.  With  some 
abuse  of  notations,  a  valid  state  representation  is  given  by 
x=QabQbcQcaSabSBcSca,  where  queue  length  QAB,  QBC,  Qca 
g  {0,  1,  2,  3}  with  total  length  L  =  QAB+QBc+  Qca  ^  3,  and 
server  state  SAB,  SBC,  SCa  e  {0,  1,  2}.  Server  state  “2”  =  data 
are  lost  in  both  the  primary  and  the  secondary  classes  in  a 
server,  “7”  =  the  data  in  the  primary  class  have  been  restored 
and  data  in  the  secondary  class  have  not  been  restored,  and 
“0”  =  data  in  both  primary  class  and  secondary  class  in  a 
server  are  intact.  A  server  is  said  to  be  in  the  down  state  if  it 
is  either  at  state  “7”  or  at  state  “2”.  For  example,  state 
110020  indicates  that  server  SAB  is  up  with  one  customer  in 
its  queue,  server  SBC  is  down  with  both  classes  of  data  gone 
and  one  customer  in  its  queue,  and  server  SCA  is  up  and  idle. 
Note  that  the  queue  length  includes  the  customer  being 
served.  There  are  540  valid  states  in  the  system.  The  total 
number  of  states  is  reduced  to  147  when  the  states  of  system 
level  failures  are  aggregated.  The  symmetry  of  the  system 
permits  the  arrangement  of  customers  in  the  queues  at  the 
time  of  system  level  failure  to  be  captured  in  one  of  seven 
states,  allowing  the  system  to  return  to  an  equivalent  state 
upon  completion  of  the  system  overhaul.  A  set  of  alternative 
state  names  are  assigned  from  2C  =  {7,  2,  ...,  147}  with 
000000  mapped  to  x  =  7  and  the  aggregated  system  failure 
states  mapped  tore  {141, 142, 143, 144, 145, 146, 147}. 

6. 2. 1.2.  Initial  state  pmf  {nx(0),  x  =  1,2, ...,147} 

It  is  assumed  that  the  database  unit  starts  operation  from 
state  x  =  7,  i.e.,  the  initial  state  probability  is  given  by  vector 
7i(0)  =  [7  0  ...  0].  When  overhaul  is  considered  at  the 
occurrence  of  a  system  level  failure,  the  system  returns  to  a 
state  with  an  equivalent  arrangement  of  customers  in  the 
queues  once  the  database  unit  is  renewed  [22]  and  ready  for 
operation  again. 


6. 2. 1.3.  Set  of  state  transition  rates  A 

A  transition  rate  table  containing  all  transition  rates  is 
created  following  a  similar  procedure  as  that  described  in 
[42],  however  with  a  more  compact  representation.  The  state 
transition  table  is  given  in  Appendix.  The  list  of  current 
states  occupies  the  first  column  of  the  table.  In  the  row 
corresponding  to  each  state,  the  set  of  all  feasible  next  states 
are  listed  with  each  next  state  followed  by  the  rate  at  which 
the  next  state  is  reached.  Events  that  trigger  the  transitions 
and  the  corresponding  transition  rates  are  given  as  follows.  A 
newly  generated  query  enters  one  of  the  servers  with  rate 

pU2  (3  -  L)  x  A  where  pU2  is  a  controlled  routing 
probability  by  control  variable  u2.  A  query  is  answered  at  a 
server  with  rate  p.  A  complete  data  loss  occurs  at  a  server 
with  rate  v.  Data  in  the  primary  data  class  of  a  server  are 
restored  with  rate  yp  u i  where  u i  authorizes  whether  to  restore 
the  lost  data.  Data  in  the  secondary  data  class  of  a  server  are 
restored  with  rate  ys  Uj.  Finally,  the  failed  database  unit  is 
renewed  with  rate  cou3,  where  u3  decides  whether  to  repair 
the  failed  system.  All  rates  are  relative,  for  their  net  effects 
depend  on  the  time  unit  specified. 

Let  X  g  2C  denote  the  random  state  variable  at  time  t.  The 
set  of  state  transition  functions 

p  jt)  ee  P[X(t)  =  j  I  X(0)  =  i\,i,j  =  1,2,-, 147  (15) 

for  the  continuous-time  Markov  chain  can  be  solved  from  the 
forward  Chapman-Kolmogorov  equation  [8] 

Pit)  =  P(t)Q,  P(0 )  =  /,  P(t)  =  [ PiJ(t)]  ,  (16) 

where  Q  is  called  an  infinitesimal  generator  or  a  rate 
transition  matrix  whose  entry  is  given  by  the  rate 

associated  with  the  transition  from  current  state  i  to  next  state 
j  in  the  rate  transition  table.  State  probability  mass  function 
at  time  t 

K{t)  =  \7zl(t)  n2{t)  ■■■  nI47(t)],t>0  (17) 

is  computed  by 

7T(t)  =  7T(0)P(t).  (18) 

At  this  point  a  Markov  model  for  the  database  unit  of 
Figure  6.2  has  been  established.  The  state  probabilities  are 
the  basis  for  evaluating  the  performance  of  the  database  unit, 
which  is  conducted  in  Subsection  C. 

6.2.2.  Control  policies 

Our  ultimate  goal  is  to  eliminate  all  single  point  failures, 
and  to  mitigate  the  effects  of  a  single  server  failure  on  the 
performance  of  the  database  unit.  Our  approach  is  to  base  the 
supervisory  control  actions  on  the  state  information,  which 
effectively  alter  the  transition  rates  when  loss  of  data  occurs 
in  a  single  server. 

Taking  into  consideration  the  symmetry  of  the  model,  the 
control  policy  is  described  only  for  the  case  of  a  failed  server 
SAB.  When  routing  control  is  effective,  the  routing 
probabilities  are  determined  by  the  state  of  SAB  and  by 
whether  the  lost  data  can  be  restored.  Thus, 

pU2  =  Pab(SaB’ui)>  Pbc(Sab>ui)>  Pca(Sab>ui)  with 
Pab  +  Pbc  +  Pca  =  1  •  The  control  policies  considered  for 
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(23) 


this  study  are  summarized  as  follows. 

0,  SAB  =  2,  SBC  serves,  SCA  serves  (no  restoration) 
Uj  =<  i  \SAB  =2,  SBC  serves ,SCA  restores  class  A  data’ 
l  $ab  =  A  SCA  serves,  SBC  restores  class  B  data 

0,  SAB  =  2,  pAB  =  PbC  =  PCA  =  — 

U2  ~  <  J  \SAB  -  2»  Pab(2’Ui)’Pbc(2’Ui)’Pca(2’Ui) 

1  Sab  =  b  Pab(Tui)’Pbc(Tui),Pca(Tui) 


(19) 


(20) 


Four  sets  of  routing  probabilities  are  shown  in  the 
following  table  as  examples,  where  SBC^0  and  SCa=0  are 
assumed. 


Table  6.1  Examples  of  routing  probabilities 


Uj 

u2 

SAB 

Pab 

Pbc 

PCA 

0 

1 

2 

0 

1/2 

1/2 

1 

0 

2(1) 

l/3(l/3) 

l/3(l/3) 

l/3(l/3) 

1 

1 

2(1) 

0  (1/6) 

2/3  (1/6) 

1/3  (2/3) 

1 

1 

2(1) 

0(0) 

1(0) 

0(1) 

The  composition  of  Uj  and  u2  gives  rise  to  four  different 
control  policies.  The  case  of  ( Uj ,  u2)  =  ( 0 ,  0 )  corresponds  to 
the  case  of  a  single  point  failure,  and  is  therefore  not 
considered  in  the  performance  analysis.  The  control  policies 
in  the  other  three  cases  are  named 

Policy  1 :  (uh  u2 )  =  ( 0 ,  7)  when  a  server  is  down, 

Policy  2:  (uj,  u2)  =  (7,  0)  when  a  server  is  down,  (21) 

Policy  3:  ( uh  u2 )  =  (7,  7)  when  a  server  is  down. 

Note  that  policy  2  does  not  permit  routing,  whereas  policy 
1  does  not  permit  restoring.  As  can  be  seen,  policy  3  allows 
variations  in  the  routing  probabilities  to  the  intact  servers.  A 
special  consideration  with  the  case  uj=0  is  the  rerouting  of 
the  customers  who  have  arrived  at  a  server  before  the  server 
fails  to  the  delay  elements. 

The  presence  of  supervisory  control  in  the  transition  rate 
table  is  seen  via  ul9  u2 ,  u3 ,  =  l-uh  n2  =  l-u2 ,  and  n3  =  l-u3. 
The  values  of  w7,  u2 ,  u3  represent  specific  control  actions 
associated  with  data  restoration,  query  routing,  and  unit 
overhaul,  respectively. 

6.3.  Performance  analysis 

6.3.1.  Time  to  system  failure 

When  u3  =  0,  the  Markov  chain  model  for  the  database 
unit  contains  seven  absorbing  states  xe  {141,  142 ,  143,  144, 
145,  146,  147}  at  which  the  chain  remains  forever  once  it  is 
entered.  These  are  the  states  of  system  level  failure.  The  rest 
of  the  140  states  are  transient  states.  Decompose  the  state 
probability  vector 

n{t)  =  [nA t)  7ta(t)l  (22) 

1x140  1x7 

where  vector  nlf)  contains  the  transient  state  probabilities, 
and  nft)  are  the  absorbing  state  probabilities.  Decomposing 
the  rate  transition  matrix  Q  and  the  state  transition  function 
matrix  P{t)  solved  from  (2)  accordingly  yields 


Qn 

Q,/ 

,P(t)  = 

~P„(t) 

P»(  o' 

0 

0 

0 

i 

From  (2),  (4),  and  (9),  it  can  be  determined  that  the 
probability  density  function  of  time  to  system  failure,  or  time 
to  absorption,  is  given  by 

4(0  =  nz{0)Pn{t)Qi2,Ka{0)  =  0,  (24) 

where 

*t(0)  =  V  0  -\,Pu(t)  =  eQllt.  (25) 

In  addition,  the  mean  time  to  failure  of  the  database  unit  can 
be  shown  to  be  [22]. 

MTTF  =  -Jtr{p)Qu  lr,  1T  -  \l  -  if  (26) 

Figure  6.3  below  shows  the  dependence  of  mean  time  to 
failure  of  the  database  unit  on  the  restoration  rate. 


X=6,  (i=12,  v=0.005  co=0.01 


Figure  6.3  Database  unit  mean  time  to  failure  versus  restoration 

rate 

6. 3. 2.  Steady-state  availability 

Suppose  as  soon  as  the  database  unit  reaches  a  system 
level  failure,  an  overhaul  process  starts.  Suppose  with  a  rate 
co  the  unit  is  repaired,  and  at  the  completion  of  the  repair,  the 
unit  immediately  starts  to  operate  again.  In  this  case  u3  is  set 
to  1  in  the  model,  whereas  it  is  set  to  0  in  the  case  of  an 
absorbing  chain.  The  existence  of  a  unique  steady-state 
distribution  of  the  Markov  chain  when  u3=l  is  guaranteed  if 
the  chain  is  irreducible  (or  ergodic)  [22].  Ergodicity  is 
satisfied  under  policy  2  and  policy  3.  Although  ergodicity  is 
not  met  under  policy  1  without  eliminating  the  few 
unreachable  states  in  this  case,  a  unique  steady  state 
distribution  is  obtained  nevertheless  in  our  computation.  The 
steady  state  availability,  which  can  be  roughly  thought  of  as 
the  fraction  of  time  the  database  unit  is  up,  is  given  by 

AV'=l-*p(oo\  (27) 

where  /rF(oo)is  the  sum  of  the  system  level  failure  state 
probabilities,  determined  by  solving 

n(oo)Q  =  0,  and  ttx  (~)  =  1 .  (28) 

Figure  6.4  shows  the  steady-state  availability  as  a  function 
of  restoration  rate  at  a  fixed  overhaul  rate.  Figure  6.6 
demonstrates  the  benefit  of  the  success  in  supervisory  control 
to  steady-state  availability. 
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fc=6,  [1=12,  v=0.005  o=0.01 


X=6,  (J.=12,  v=0.005  ca=0.01 
-  1  - 


Figure  6.4  Steady-state  availability  of  the  database  unit  versus 
restoration  rate 


6.3.3.  Response  time 

The  average  response  time  E[R ]  is  the  expectation  of  the 
ratio  of  total  amount  of  time  that  all  customers  spend  in  the 
upper  portion  of  the  system  to  the  number  of  customers  that 
are  serviced.  A  loose  argument  is  given  below  to  justify  the 
way  E[R ]  is  computed  in  this  subsection.  Define  the  vector  C 
where  c(i)  is  the  number  of  customers  in  the  system  at  state  i. 
The  numerator  of  E[R]  is  then  n(po)Ct.  Computing  the 
number  of  customers  that  are  serviced  requires  counting  the 
number  of  transitions  from  one  state  to  another  that  have 
occurred  that  have  introduced  a  new  customer  to  the  system. 
Define  a  matrix  N  such  that  n(ij)  is  equal  to  the  number  of 
customers  introduced  into  the  system  when  the  system 
transitions  from  state  i  to  state  j.  The  total  number  of 
transitions  for  a  given  i  and  j  is  then 

T  (/,  j)  =  tN  (i,  j)ni  (oo  )Q(i,  j ) .  (29) 

Therefore,  the  average  response  time  E[R]  of  the  system  is 
taken  as 

x(°°)Ct  _ _ 7T(°°)C_ _  (30) 

147147  147147 

I  TT(iJ)  I  YNUj^iMQUj) 

i=lj=l  i=l j=l 

Figure  6.5a  and  Figure  6.6  show  the  average  response 
time  as  a  function  of  restoration  rate  with  the  overhaul  rate 
fixed,  and  a  function  of  overhaul  rate  with  the  restoration 
rate  fixed,  respectively,  for  all  three  policies.  The  routing 
probabilities  in  rows  1  through  3  in  Table  6.1  are  in  fact  used 
for  calculating  all  performance  measures  resulting  from 
Policies  1  through  3,  respectively.  Policy  1  enjoys  a  lower 
response  time  because  the  intact  servers  need  not  deny 
customers  in  order  to  restore  the  failed  server.  Also, 
customers  present  at  the  time  of  server  failure  in  policy  1  are 
emptied  into  the  delay  elements  and  incur  no  response  time 
gains. 


(a) 


(b) 


Figure  6.5  Average  query  response  time  versus  restoration  rate 


Figure  6.5b  shows  the  effect  of  applying  Policy  3*:  routing 
all  customers  to  the  intact  server  that  is  not  restoring  the 
failed  server,  an  alternative  to  Policy  3.  The  reduced 
response  time  in  policy  3*  results  from  customers  not  waiting 
at  a  failed  server.  This  policy  may  not  be  as  advantageous  in 
a  system  of  higher  traffic  intensity. 


k=6,  |i=12,  v=0.005  y=0.05 


Figure  6.6  Average  query  response  time  versus  overhaul  rate 
6.3.4.  Overhead 

Overhead  is  a  quantity  introduced  to  reflect  the  ratio  of  the 
time  invested  on  helping  the  database  unit  to  survive  longer 
to  its  overall  busy  time.  It  is  a  measure  of  the  cost  of 
supervisory  control.  More  specifically, 
q  _  restores  or  fails  |  unit  is  not  failed]  ^  ^ 

Pr[*S'y45  restores  or  fails  or  serves  |  unit  is  not  failed] 
Overhead  0  is  calculated  for  both  the  absorbing  chain  (u3  = 
0 )  as  a  function  of  time,  and  the  irreducible  chain  (u3  =  1)  as 
a  function  of  server  failure  rate.  These  are  shown  in  Figure 
6.7  and  Figure  6.8.  In  Figure  6.7,  it  is  seen  that  restoration 
incurs  a  higher  overhead  in  the  early  life  of  the  unit.  As  the 
database  unit  ages,  its  server  becomes  more  likely  to  fail.  A 
control  policy  that  permits  restoration  becomes 
advantageous.  There  is  a  reduction  in  overhead  across  all 
polices  with  an  increase  in  the  arrival  rate  because  of  the 
resulting  increased  utilization.  In  Figure  6.8,  for  sufficiently 
low  server  failure  rate,  overhead  is  always  lower  with 
restoration.  When  server  failure  rate  passes  some  threshold, 
however,  restoration  becomes  expensive.  Overhead  is 
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expected  to  gain  more  significance  as  a  function  of  time  and 
a  function  of  server  failure  rate  when  the  server  life 
distributions  have  an  increasing  failure  rate,  such  as  in  the 
case  of  Weibull  distribution. 


Absorbing  System  (|i=12,  v=0.005  >=0.05  m=0) 
(a)  A.  =  6  (b)  Jl  =  50 


Figure  6.7  Overhead  versus  time  in  the  absorbing  chain 


terms  of  the  types  of  services,  the  number  of  customers,  and 
the  types  of  distributions  of  event  lives. 

Also  ongoing  is  the  extension  of  this  study  to  incorporate 
the  effect  of  decision  and  control  under  uncertainty  and  time 
delay  due  to,  for  example,  incomplete  state  information  and 
the  time  required  for  state  estimation,  respectively.  The 
results  will  be  reported  in  a  future  section. 
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Irreducible  System  (|i=12,  y=0.05  to=0.01) 

(a)  A.  =  6  (b)  A.  =  50 


Figure  6.8  Overhead  versus  failure  rate  in  the  irreducible  chain 
6.4.  Section  summary 

This  section  modeled  a  redundant  database  unit  in  C2  for 
investigation  of  fault-tolerance  and  responsiveness  afforded 
by  a  set  of  supervisory  control  policies.  In  all  the 
performance  measures  examined,  restoration  (u/)  is  more 
effective  than  routing  ( u2 ).  It  is  expected  that  when  the 
number  of  queries  increase,  or  the  traffic  becomes  more 
intensive,  the  effectiveness  of  routing  will  be  more  apparent. 

The  study  presented  in  this  section  is  limited  by  our  ability 
to  deal  with  complex  problems  analytically.  Most  restrictive 
is  the  size  of  the  state  space.  The  closed-queuing  network 
model  shown  in  Figure  6.2  presents  perhaps  the  smallest 
possible  state  space  for  which  the  investigation  on  control 
policies  is  nontrivial.  Besides  answering  queries,  the 
database  unit  also  must  be  updated  from  time  to  time.  In  that 
case,  two  types  of  service  requests  exist  and  the  state  space 
must  be  expanded.  Almost  equally  restrictive  is  the 
assumption  that  times  to  event  occurrence  are  exponentially 
distributed.  Since  there  is  only  one  parameter  in  an 
exponential  distribution,  it  is  likely  to  be  unsuitable  to 
truthfully  describe  some  of  the  processes.  Discrete  event 
simulations  are  being  carried  out  where  the  simplifying 
assumptions  are  removed  to  substantiate  our  claims  on  the 
benefit  of  supervisory  control  under  more  general  settings  in 


31 


7.2.1.  System  Model 


7.  A  Simulation  Study  of  the  Effect  of  Supervisory 
Control  on  a  Redundant  Database  Unit 

7.1.  Problem  description 

HE  focus  of  this  work  is  on  studying  the  effect  of 
supervisory  control  [8]  on  a  number  of  important 
measures  that  pertain  to  C2  system  performance  with  a 
redundant  architecture  first  proposed  and  studied  in  [51], 
where  a  database  is  partitioned  in  a  way  that  allows  multiple 
servers  to  process  customers  in  parallel  with  information 
backed-up  throughout  the  system.  The  proposed  architecture 
is  shown  in  Figure  7.1,  where  the  data  are  partitioned  into  the 
sets  A,  B,  and  C.  Customers  entering  the  system  are  routed 
based  on  the  type  of  information  they  require. 

To  enhance  fault-tolerance  in  the  face  of  crash  and  site 
failure,  and  improve  the  responsiveness  to  queries, 
supervisory  control  is  applied  to  the  partitioned  database 
unit.  The  response  time  and  availability  can  be  potentially 
improved  by  strategically  routing  customers  based  on  the 
state  of  the  servers.  Supervisory  control  introduces  policies 
that  allow  the  restoration  of  lost  data  and/or  the  routing  of 
queries  based  on  the  state  of  the  information  in  the  system. 

The  objectives  of  this  work  are  to  qualitatively  analyze  the 
performance  of  the  partitioned  database  unit  under 
supervisory  control  and  varying  structural  parameters,  in 
terms  of  MTTF,  availability,  response  time,  and  overhead, 
based  on  the  results  obtained  through  discrete  event  system 
(DES)  simulation. 


Figure  7. 1 .  Partitioned  database  unit 


7.2.  Background 

The  presentation  in  this  section  is  drawn  from  [5 1]  to 
recapitulate  aspects  of  modeling,  control,  and  performance 
analysis  of  the  database  unit  shown  in  Figure  7.2  [51],  to 
identify  the  limitations  of  the  analytical  method  employed 
there,  and  to  briefly  describe  the  extensions  made  in  this 
section. 


Figure  7.2.  Closed  queuing  network  model 


The  database  unit  to  be  studied  is  taken  from  [51],  which 
is  intended  to  be  representative  of  a  C2  supporting  system.  A 
closed  queuing  network  representation  of  the  unit  is  shown  in 
Figure  7.2.  The  information  contained  within  the  system  is 
partitioned  into  sets  A,  B,  and  C  and  placed  on  three  servers 
that  exist  in  parallel  to  answer  three  classes  of  queries  A,  B, 
and  C,  respectively.  Server  Sab  contains  class  A  primary  data 
and  class  B  secondary  data.  Server  SBc  contains  class  B 
primary  data  and  class  C  secondary  data.  Server  SCa  contains 
class  C  primary  data  and  class  A  secondary  data.  When  a 
server  fails,  both  its  primary  and  secondary  data  are  lost.  A 
server  is  “down”  when  either  class  of  data  are  lost,  and  a 
system  failure  occurs  when  two  servers  are  down 
concurrently. 

The  queues  preceding  Sab,  SBc,  and  SCa  are  named  Qab, 
Qbc,  and  QCa,  respectively.  They  are  of  sufficient  size  that 
no  queries  are  lost  or  blocked  and  operate  on  a  first  come, 
first  serve  (FCFS)  basis. 

The  delay  elements,  each  labeled  X  indicating  an  average 
delay  1/A,  are  representative  of  the  response  times  incurred 
at  other  nodes  of  the  C2  supporting  system  which  are  not 
modeled  here.  The  three  elements  imply  that  there  are  only 
three  customers  in  the  system  at  any  given  time,  a  limitation 
of  the  Markov  model  in  [51]  that  is  to  be  removed  in  this 
study.  Upon  completion  of  processing  at  a  server,  a  customer 
returns  to  one  of  its  delay  elements,  and  after  a  period  of 
time,  re-enters  the  system.  Each  time  a  customer  enters  the 
system  it  is  equally  likely  to  require  information  of  class  A, 
B,  or  C.  Therefore,  under  normal  operating  conditions,  the 
routing  probabilities  pAB,  Pbc,  and  pCA,  where  pAB  +  Pbc  + 
Pca  =  1,  are  given  the  same  values. 

The  model  is  built  with  the  premise  that  event  lifetime 
distributions  have  been  established  for  all  the  processes 
involved.  The  delay  process,  or  equivalently,  query 
generation,  has  an  exponential  distribution  exp(X)=l-e-A*, 
where  X  is  the  rate  and  1/  A  is  the  mean.  The  same  is  true  for 
the  process  of  service  completion  (exp(jap)),  the  process  of 
server  failure  (exp(v)),  the  process  of  data  restoration 
(exp(y)),  and  the  process  of  unit  overhaul  (exp(co)),  when  the 
entire  unit  is  repaired  due  to  system  failure.  All  processes  are 
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independent.  Note  that  all  rates  and  therefore  means  are 
relative  and  carry  the  units  time"1  and  time,  respectively. 

7.2.2.  Control  Policies 

To  maximize  the  efficiency  of  the  database  unit  under 
server  failures,  two  supervisory  control  inputs  are  introduced 
based  on  the  state  information  of  the  system.  These  control 
actions  alter  the  transition  rates  of  the  system  when  data  loss 
occurs  in  a  server  for  the  purpose  of  improving  performance. 
The  necessary  state  information  is  the  current  state  of  the 
servers.  Define  server  state  Sab,  SBc?  SCa  e  {0,1,2}  where 
“2”  =  both  the  primary  data  and  the  secondary  data  are  lost 
in  a  server,  “1”  =  the  primary  data  have  been  restored  but  the 
secondary  data  have  not  yet  been  restored,  and  “0”  =  the 
primary  data  and  secondary  data  in  the  server  are  both  intact. 
A  server  is  failed,  or  in  the  down  state,  when  either  class  of 
data  are  lost,  and  up,  when  both  the  primary  and  secondary 
data  are  intact. 

Two  supervisory  control  inputs,  ux  and  u2 ,  govern 
restoration  and  routing,  respectively.  The  control  input  ux 
allows  an  intact  server  to  halt  its  current  process  and  restore 
lost  data  in  a  failed  server,  and  input  u2  adjusts  the  routing 
probability  of  customers  based  on  the  state  of  the  servers. 
Because  of  the  symmetry  of  the  model,  the  control  inputs  and 
policies  may  be  sufficiently  described  by  the  case  of  only 
one  failed  server  Sab,  where  the  remaining  two  servers  must 
be  intact  for  the  system  to  be  up.  The  control  inputs  may  be 
summarized  as  follows. 

0,  SAB  =  2,  SBC  serves,  SCA  serves  (no  restoration) 

SAB  =  2,  SBC  serves,  SCA  restores  class  A  data  ’  ^2) 

SAB  =  1,  SCA  serves,  SBC  restores  class  B  data 

S AB  =  2?  PaB  =  PbC  =  PcA  =  ^  .  (23) 

S ab  —  2,  pAB(2,ul),pBC(2,ul),pCA(2,ul ) 

S  AB  —  h  PaB  (h  U\ )’  PbC  (h  U\ )’  PcA  (h  U\ ) 

Recall  the  routing  probabilities  Pab,  Pbc?  and  pCA-  Under 
supervisory  control,  these  probabilities  are  dependent  not 
only  on  the  routing  control  input  u2  and  the  state  of  the 
servers,  but  also  on  the  restoration  control  input  u{.  Table  7.1 
shows  three  sets  of  routing  probabilities. 


Table  7.1  Examples  of  routing  probabilities 


U 1 

u2 

Sab 

Pab 

Pbc 

PCA 

0 

1 

2 

0 

1/2 

1/2 

1 

0 

2(1) 

l/3(l/3) 

l/3(l/3) 

l/3(l/3) 

1 

1 

2(1) 

0  (1/6) 

2/3  (1/6) 

1/3  (2/3) 

The  composition  of  Uj  and  u2  gives  rise  to  four  different 
control  policies.  The  case  of  (i/7,  u2)  =  ( 0 ,  0)  corresponds  to 
the  case  of  a  single  point  failure,  and  is  therefore  not 
considered  in  the  performance  analysis.  The  control  policies 
in  the  other  three  cases  are  named 

Policy  1:  ( uh  u2 )  =  ( 0 ,  1 )  when  a  server  is  down, 

Policy  2:  ( uh  u2)  =  ( 1 ,  0 )  when  a  server  is  down,  (34) 

Policy  3:  (i/7,  u2)  =  ( 1 ,  1)  when  a  server  is  down. 

Note  that  policy  2  does  not  permit  routing,  whereas  policy 

1  does  not  permit  restoring.  A  special  consideration  with  the 


case  u j=0  is  the  rerouting  of  the  customers  who  have  arrived 
at  a  server  before  the  server  fails  to  the  delay  elements. 

7.2.3.  Analytic  Results 

The  above  system  was  first  studied  in  Error!  Reference 
source  not  found,  where  the  database  unit  was  modeled  as  a 
closed  Markovian  queuing  network.  The  performance  of  the 
system  under  supervisory  control  was  evaluated  based  on  the 
performance  measures  of  mean  time  to  failure  (MTTF)  of  the 
system,  steady-state  availability,  expected  response  time,  and 
service  overhead. 

System  failure  is  defined  as  the  loss  of  a  second  server 
before  the  restoration  of  a  first  failed  server  is  completed. 
MTTF  is  a  measure  of  the  life  of  the  system.  The  MTTF 
was  found  to  significantly  improve  under  policies  2  and  3, 
apparently  attributed  to  the  introduction  of  restoration. 
Availability,  a  measure  of  the  percentage  of  time  the  system 
is  available  to  serve  customers  (i.e.,  not  in  a  system  failure 
state),  also  improved  under  these  policies. 

The  expected  response  time,  defined  as  the  length  of  time 
a  query  spends  in  the  upper  portion  of  the  system  shown  in 
Figure  7.2,  also  benefited  from  policies  2  and  3  for  a 
sufficiently  high  restoration  rate  y.  However,  at  low  values  of 
y,  the  system  profited  from  not  having  to  devote  a  majority 
of  its  resources  to  restoring  failed  servers  but  simply 
servicing  customers  with  its  two  intact  nodes  under  policy  1 . 
Policy  3  showed  slightly  better  performance  than  policy  2; 
this  advantage  is  expected  to  improve  with  the  addition  of 
customers  to  the  system. 

Overhead  is  defined  as  the  cost  incurred  by  the  system  for 
self-preservation.  It  is  calculated  as  the  ratio  of  time  invested 
in  restoring  the  system  to  its  overall  busy  time,  and  does  not 
include  the  overhaul  process.  As  the  failure  rate  v  increased, 
policies  2  and  3  became  expensive,  and  surpassed  the 
overhead  associated  with  policy  1 . 

7.2.4.  Limitations 

The  Markov  model  of  the  database  unit  presented  in  [51] 
suffered  many  limitations.  The  complexity  of  the  model  was 
restricted  by  the  need  for  a  manageable  number  of  states  and 
the  exponential  event  lifetime  distributions.  A  linear  increase 
in  either  the  number  of  customers  or  the  number  of  servers 
allowed  in  the  system  causes  an  exponential  growth  in  the 
number  of  states,  ,  whereas  any  non-exponential  event 
lifetime  distribution  destroys  the  memoryless  properties 
essential  to  a  Markov  model,  although  many  of  the 
processes  under  consideration  are  most  adequately 
described  by  non-exponential  distributions. 

A  majority  of  database  units  require  a  periodic  update  to 
the  information  contained  within  the  system  to  maintain  the 
currency  of  the  data.  As  seen  in  Figure  7.2,  [51]  omitted  the 
updating  process  to  avoid  the  explosion  of  the  size  of  the 
state  space  due  to  the  additional  class  of  customers  and  the 
non-Poisson  nature  of  the  update  requests.  . 

Modeling  the  database  unit  by  means  of  a  simulation  tool 
enables  us  to  remove  the  limit  on  the  number  of  customers, 
diversify  the  event  lifetime  distributions,  and  include  the 
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update  process. 

7.3.  Simulation  Model 

7.3.1.  Discrete  Event  System  Simulation 

Modeling  of  systems  in  which  the  state  variable  changes 
only  at  a  discrete  set  of  points  in  time  is  known  as  discrete 
event  system  (DES)  simulation  [1].  Simulation  implies 
solving  for  the  system  variables  through  numerical  rather 
than  analytic  methods.  Observations  of  the  variables 
collected  throughout  the  history  of  the  model  are  stored  and 
processed  to  evaluate  system  performance  measures.  A 
major  component  in  a  discrete  event  system  simulation  is  the 
future  event  list  which  contains  the  notices  for  all  future 
events  scheduled  to  occur.  For  each  event  that  occurs, 
beginning  with  the  first  event  of  the  simulation,  durations  are 
either  computed  or  drawn  from  a  statistical  distribution,  and 
the  end-event  is  added  to  the  future  event  list.  The  advantage 
of  this  method  is  that  every  time  instance  need  not  be 
evaluated,  allowing  the  simulation  to  omit  time  intervals 
where  the  state  of  the  system  does  not  change. 

The  simulation  package  used  in  this  study  is  Arena® 
Professional  Edition  [35].  Arena®  utilizes  an  object-based 
design  for  graphical  model  development  [1].  Objects  called 
modules  are  used  to  model  system  logic  and  physical 
components  such  as  servers  and  queues.  In  addition,  Arena® 
provides  methods  for  statistical  distributions,  failure 
modeling,  statistics  collection,  and  process  analysis.  Arena® 
allows  any  number  of  independent  replications  to  be  run  for 
a  simulation,  with  the  replication  terminating  upon  a  user 
defined  condition.  System  as  well  as  user  defined  statistics 
are  collected  for  each  replication  and  evaluated  for  the  entire 
simulation. 

In  this  study,  the  effect  of  supervisory  control  is  evaluated 
for  MTTF,  system  availability,  expected  response  time,  and 
overhead  as  in  [51].  MTTF  is  evaluated  using  the  method  of 
independent  replications,  whereas  system  availability, 
expected  response  time,  and  overhead  are  evaluated  using 
the  method  of  regenerative  simulations  [27].  All  calculated 
performance  measures  are  obtained  from  simulation  with 
100  replications  unless  otherwise  noted.  The  Process 
Analyzer  is  a  tool  in  Arena®  that  allows  a  series  of 
simulations  (scenarios)  with  varying  system  parameters 
(controls)  to  be  run  automatically  in  succession  and  displays 
the  chosen  system  outputs  (responses).  This  feature  proved 
extremely  useful  in  the  evaluation  of  multiple  performance 
measures  as  a  function  of  varying  system  parameters.  As  in 
[51],  supervisory  control  is  evaluated  based  upon  MTTF, 
system  availability,  expected  response  time,  and  overhead. 

7.3.2.  Model  Verification 

The  model  shown  in  Figure  7.2  and  described  in  Section 
II-A  was  simulated  via  Arena®  under  the  supervisory  control 
policies  presented  in  Section  II-B.  The  results  from  each 
modeling  method  were  compared  for  verification  purposes. 
The  MTTF  and  availability  corresponded  between  the  two 
simulation  methods,  as  did  the  system  overhead.  However, 


the  expected  response  time  calculated  from  simulation  was 
significantly  lower  due  to  the  fact  that  the  simulation 
calculation  is  not  a  steady-state  measure.  Response  time 
statistics  are  only  able  to  be  collected  for  customers  that 
enter  and  exit  the  system.  Customers  that  are  trapped  outside 
a  failed  system  do  not  contribute  to  the  calculated  response 
time.  Therefore,  only  customers  who  are  in  a  server  when  the 
system  fails  will  suffer  the  delay  of  the  overhaul  process.  The 
limited  number  of  customers  makes  this  delay  insignificant, 
resulting  in  lower  response  times  for  the  simulation  model. 
This  deficiency  no  longer  exists  when  an  open  queuing 
system  is  introduced  in  the  next  section. 

With  a  simulation  model  constructed  and  verified  in 
Arena®,  the  limitations  on  the  number  of  customers,  the 
event  lifetime  distributions,  and  the  update  process  discussed 
previously,  may  now  be  removed,  as  presented  in  the 
following  section. 

7.4.  Analysis  via  Simulation 

7.4.1.  Open  queuing  network 

A  fixed  number  of  customers  severely  limits  our  ability  to 
fully  observe  the  behavior  of  the  system,  but  it  is  necessary 
to  model  the  database  analytically.  Simulation  modeling 
removes  this  restriction,  and  more  realistic  measures  of 
system  performance  are  provided. 

The  system  is  modeled  in  Arena®  as  the  open  queuing 
network  shown  in  Figure  7.1  where  customers  enter  the 
system  with  an  exponentially  distributed  inter-arrival  time 
exp(^).  The  customers  are  removed  from  the  system  upon 
service  completion. 

The  MTTF  and  availability  of  the  open  queuing  network 
are  statistically  indistinguishable  from  that  obtained  in  the 
closed  simulation  model.  The  expected  response  time 
however  does  increase  significantly  now  that  customers  are 
freely  allowed  to  enter  and  accumulate  in  the  system.  The 
benefits  of  routing  control  are  apparent,  as  shown  in  Figure 
7.3.  Policy  3  realizes  a  lower  response  time  with  customers 
routed  strategically  by  supervisory  control  input  u2. 
However,  as  failed  servers  are  restored  at  a  higher  rate,  the 
advantage  of  policy  3  decreases.  Routing  control  becomes 
less  beneficial  because  queue  lengths  do  not  grow  as  large  at 
failed  servers. 
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Figure  7.3.  Expected  response  time  of  open  queuing  network 
versus  restoration  rate 

Overhead  is  less  sensitive  to  the  type  of  queuing  network. 
The  values  shown  in  Figure  7.4  correspond  to  those  obtained 
analytically  in  the  closed  queuing  network.  The  overhead  of 
policy  1  is  unaffected  by  an  increase  in  the  rate  of  failure 
because  the  system  is  never  required  to  restore  itself. 
Restoration  is  beneficial  at  low  failure  rates,  however, 
beyond  some  threshold,  it  becomes  expensive  to  the  system. 


is  expected  to  decrease.  This  is  true  for  the  policies  involving 
restoration;  however  it  is  not  the  case  for  policy  1,  which  is 
unaffected.  For  policies  allowing  restoration,  the  MTTF  is 
dependent  on  a  failed  server  recovering  before  another 
failure  occurs.  A  time  dependent  failure  rate  causes  failures 
to  occur  more  closely  (assuming  the  lifetimes  begin 
concurrently),  increasing  the  likelihood  of  overlapping 
failures,  and  reducing  the  MTTF.  Policy  1  is  unaffected 
because  the  failure  of  a  second  server  always  results  in  a 
system  failure.  The  MTTF  obtained  for  policy  1  under  each 
type  of  distribution  is  statistically  equal  because  the  mean 
values  of  both  distributions  are  equal. 

Table  7.2  Comparison  of  exponential  and  generalized  distributions 


MTTF  Availability 


Policy 

Exponential 

Generalized 

Exponential 

Generalized 

1 

169 

172 

.61 

.62 

2 

317 

244 

.71 

.69 

3 

317 

265 

.71 

.69 

Expected  response  time  decreases  under  the  generalized 
event  lifetime  distributions,  as  shown  in  Figure  7.5,  however 
the  values  follow  the  same  trend  as  those  shown  in  Figure 
7.3. 


Figure  7.4.  Overhead  of  open  queuing  network  versus  failure  rate 

7. 4. 2.  Generalized  Distributions 

Simulation  of  the  database  unit  permits  the  removal  of  the 
limitation  to  exponentially  distributed  event  lifetimes.  The 
exponential  distribution  has  a  constant  failure  rate  and 
therefore  is  unfit  for  many  event  lifetimes.  For  example,  a 
component  with  a  failure  process  described  by  an 
exponential  distribution  has  a  constant  failure  rate  and  is 
therefore  probabilistically  always  as  good  as  new,  regardless 
of  its  age  [39],  while  in  reality,  most  components  are  more 
likely  to  fail  as  they  age.  For  the  distributions  described 
below,  parameters  such  as  the  shape  parameter  a  and  the 
scaling  parameter  (3,  are  chosen  to  provide  a  mean  equivalent 
to  that  of  the  exponential  distribution  previously  used. 

The  arrival  process  lifetime  remains  exponentially 
distributed  (exp(}i)).  Often  systems  undergo  certain  ’’busy” 
periods,  but  for  the  purposes  of  this  study,  customers  will 
arrive  at  a  constant  rate.  The  gamma  distribution  is  often 
used  to  represent  the  time  required  to  complete  a  task  [23] 
and  is  therefore  used  to  describe  the  process  of  service 
completion  (gamma(l/ap=jLi)).  A  component  lifetime  is 
better  described  by  a  distribution  that  reflects  the  age  of  the 
component.  The  Weibull  distribution  has  a  rate  that  varies 
with  time,  and  is  used  to  describe  the  failure  process 
(Weibull(a,l/p=v)).  For  a  >  1,  the  probability  of  failure 
increases  with  age.  Triangular  distributions  with  mode  m  are 
used  for  the  restoration  process  (tria(l/m=y))  and  the 
overhaul  process  (tria(l/m=co))  because  these  event  lifetimes 
are  relatively  deterministic.  The  time  required  to  restore  a 
known  amount  of  data  should  not  vary  significantly. 

With  the  failure  event  lifetime  now  dependent  on  time,  the 
MTTF  and  availability  of  the  system,  as  shown  in  Table  7.2, 


Overhead  is  shown  in  Figure  7.6  for  the  generalized 
distributions.  Comparison  with  Figure  7.4  shows  a  decrease 
over  that  observed  from  the  use  of  exponential  distributions. 
Using  the  Weibull  distribution,  servers  are  more  likely  to  fail 
as  they  age,  resulting  in  less  time  devoted  to  restoration  in 
the  early  stages  of  their  lifetime.  As  the  system  ages,  more 
simultaneous  failures  are  likely  to  occur.  When  failures 
occur  close  together,  the  time  a  server  spends  restoring 
another  server  decreases  because  the  overhaul  process  takes 
over  to  restore  the  system. 


Figure  7.5.  Expected  response  time  versus  restoration  rate  using 
generalized  distributions 
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Policy  2 
Policy  3 


Update  Interval 


Figure  7.6.  Overhead  versus  failure  rate  using  generalized 
distributions 

7.4.3.  System  Update  Process 

In  order  to  keep  the  information  stored  in  the  database  unit 
current  and  useful,  the  system  must  be  periodically  updated. 
While  an  overhaul  of  the  system  will  update  the  data,  system 
failure  should  occur  infrequently,  resulting  in  the  need  for  a 
system  update  at  a  steady  interval  in  the  form  of  an  update 
entity. 

Update  entities  arrive  at  a  deterministic  rate,  with  the 
inter-arrival  time  being  the  acceptable  age  of  the  information 
in  the  system.  An  update  entity  is  sent  to  each  server  and 
becomes  the  first  in-line  at  the  queue.  Both  the  primary  data 
and  secondary  data  for  each  server  are  contained  on  the 
update  entity.  It  follows  that  the  time  to  process  an  update 
entity  is  described  by  twice  the  restoration  process 
distribution.  A  server  is  unavailable  while  processing  an 
update  entity.  An  update  entity  arriving  at  a  failed  server  will 
restore  that  server,  a  significant  benefit  to  policy  1 ,  as  well  as 
the  remaining  policies,  under  which  servers  no  longer  have 
to  restore  a  failed  server  that  is  processing  an  update  entity. 
It  is  important  to  note  that  an  update  to  a  server  does  not 
reset  the  lifetime  of  the  component. 

The  update  interval  only  partially  determines  the  age  of 
the  data  in  the  unit.  The  data,  on  average,  will  only  be  as  old 
as  the  minimum  of  the  update  inter-arrival  time  or  the 
MTTF,  which  will  result  in  the  system  being  overhauled  with 
current  data.  The  age  of  the  data  contained  within  the 
database  unit  is  shown  in  Figure  7.7  versus  the  update 
interval.  At  low  intervals,  the  data  is  as  old  as  the  inter¬ 
arrival  time.  As  the  interval  increases,  the  age  of  the  data 
reaches  a  maximum  of  the  MTTF  of  each  policy  given  in 
Table  7.2.  At  these  high  intervals,  the  data  is  being  updated 
only  by  the  system  overhaul  process. 


Figure  7.7.  Age  of  data  versus  the  update  interval 

The  update  interval  has  a  significant  impact  on  the 
availability  of  the  system,  as  shown  in  Figure  7.8.  As  the 
update  interval  increases,  the  system  is  more  available  to 
answer  queries.  Availability  increases  until  the  update 
interval  is  so  large  it  neither  improves  MTTF  nor  hinders 
processing  queries,  and  it  reaches  the  steady-state  value 
given  in  Table  7.2.  Policy  1  enjoys  a  higher  availability  at 
low  update  intervals  because  failed  servers  are  being  restored 
by  the  update  process,  which,  on  average,  is  2.5  times  faster 
than  the  overhaul  process.  Beyond  an  update  interval  equal 
to  its  MTTF,  policy  1  no  longer  benefits  from  restoration 
provided  by  the  update  process  and  its  availability 
diminishes  slightly. 
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Figure  7.8.  Availability  versus  the  update  interval 

Frequent  system  updates  tax  the  resources  of  the  database 
unit  causing  an  increase  in  the  expected  response  time,  as 
shown  in  Figure  7.9.  As  the  update  interval  increases,  the 
customer  service  interruption  caused  becomes  negligible, 
and  the  expected  response  time  reaches  the  values  shown  in 
Figure  7.5.  Policy  1  experiences  a  substantial  increase  in 
expected  response  time  at  a  low  update  interval  because 
many  times  only  two  servers  are  available  to  process  queries. 
The  impact  on  expected  response  time  is  more  severe  when 
those  servers  are  interrupted  to  process  update  entities. 
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benefits  of  routing  control  are  slightly  more  visible  for  an 
unlimited-sized  population  as  compared  to  the  results 
obtained  in  [51].  The  addition  of  an  update  process,  while 
necessary,  weighs  heavily  on  the  performance  of  the  system. 
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Figure  7.9.  Expected  response  time  versus  the  update  interval 


Updating  the  system  is  considered  time  invested  in 
maintaining  the  database  unit.  Therefore,  the  expression  for 
overhead  6  in  Error!  Reference  source  not  found,  is 
modified  to 

^  Pr[s^  restores  or  fails  or  updates |unit  is  not  failed]  (4) 

~~  Pr[S^B  restores  or  fails  or  updates  or  serves|unit  is  not  failed] 
Overhead  improves  as  the  update  inter-arrival  time 
increases,  as  shown  in  Figure  7.10.  Restoration  of  failed 
servers  by  an  update  entity  is  not  assessed  as  overhead  for 
policy  1  because  the  database  unit  is  failed  during  this  time. 
Therefore,  overhead  is  significantly  lower  for  policy  1 . 


Figure  7.10.  Overhead  versus  the  update  interval 
7.5.  Section  summary 

The  use  of  discrete  event  system  simulation  allows  the 
removal  of  limitations  imposed  by  having  to  represent  a 
database  unit  analytically  as  a  Markov  model.  A  more 
practical  system  may  be  evaluated  that  includes  an  open 
queuing  network,  dynamic  event  lifetime  distribution  rates, 
and  an  update  process.  This  section  has  modeled  and 
evaluated  a  database  unit  representative  of  a  C2  supporting 
system  under  several  supervisory  control  policies.  The 
effects  of  restoration  (ux)  and  routing  ( u2 )  were  assessed 
based  on  measures  of  fault-tolerance  and  responsiveness. 
While  restoration  remains  more  beneficial  than  routing,  the 
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8 .  Performance  of  a  Controlled  Database  Unit 
Subject  to  Decision  Errors  and  Control  Delays 

8.1.  Problem  description 

recent  effort  to  install  and  test  monitoring  tools  and  to 
increase  the  level  of  redundancy  in  critical  subsystems 
in  air  operation  centers  has  provided  opportunities  for  vast 
performance  improvement  in  its  command  and  control 
supporting  systems.  Our  previous  work  on  a  controlled 
processing  unit  [44]  has  demonstrated  that  reduced  response 
time  to  service  requests  and  shortened  periods  of  system 
unavailability,  as  a  result  of  automated  monitoring  and 
control,  can  raise  significantly  the  probability  to  attain  the 
desired  outcome  in  an  air  operation.  A  more  recent  study  by 
Wu,  Metzler,  and  Linderman  [51]  on  a  database  unit  as 
shown  in  Figure  8.1  further  revealed  the  benefits  of  a 
conscientious  design  of  redundant  architecture,  and  the 
application  of  supervisory  control,  which  were  measured  in 
terms  of  the  mean  time  to  unit  failure,  the  steady  state 
availability,  the  expected  response  time,  and  the  service 
overhead  of  the  database  unit. 

To  assess  the  performance  in  a  quantified  manner,  both 
the  processing  unit  [44]  and  the  database  unit  (Figure  8.1) 
[51]  were  given  the  interpretation  of  a  queuing  network 
Error!  Reference  source  not  found.,  [24]  with  specific  sets 
of  operating  policies  and  structural  parameters.  The  control 
authorities  considered  included  the  ability  to  restore  the  first 
failed  server,  and  the  ability  to  route  service  requests.  In 
order  to  obtain  an  analytic  model  of  manageable  size  for 
scrutinizing  the  effects  of  supervisory  control,  the  queuing 
network  was  restricted  to  the  closed  type  Error!  Reference 
source  not  found.,  [24].  In  addition,  all  the  event  lifetime 
distributions  were  assumed  to  be  exponential.  A  simulation 
study  was  conducted  by  James  Metzler  et  al.,  [32]  using 
Arena  [23],  [53]  with  all  the  above  restrictions  removed. 


Figure  8.1  A  partitioned  database  unit 
An  underlying  assumption  of  the  existing  study  is  that  the 
state  information  in  the  queuing  network  model  of  a  given 
unit  is  known  exactly  at  any  given  time.  In  reality,  however, 
it  is  not  practical  to  monitor  every  state  variable.  As  a  result, 
the  knowledge  on  a  certain  set  of  states  is  inferred  based  on 
the  observables.  On  the  other  hand,  control  actions  are  likely 
required  at  the  time  of  a  state  transition,  such  as  the 
occurrence  of  a  component  failure,  in  which  case  a  process 
of  diagnosis  must  take  place  before  a  state-based  control 


action.  The  time  required  for  diagnosis  can  be  random,  and 
the  outcome  of  the  diagnosis  can  be  uncertain.  The 
objectives  of  this  section,  therefore,  are  to  seek  for  ways  to 
incorporate  the  effects  due  to  decision  errors  and  control 
action  delays  into  the  Markov  model  of  a  queuing  network, 
and  to  use  the  model  to  access  the  impact  of  such  errors  and 
delays  on  the  performance  of  the  database  unit  in  Figure  8.1. 

The  section  is  organized  as  follows.  Subsection  B 
describes  the  baseline  model  of  the  controlled  database  unit 
in  Figure  8.1.  Subsection  C  discusses  our  approaches  to 
modeling  the  effects  of  control  delays  and  decision  errors. 
Subsection  D  presents  the  results  of  performance  evaluation 
parameterized  with  respect  to  the  amount  of  control  action 
delay  and  the  probability  of  error. 

8.2.  Baseline  model  for  a  controlled  database  unit 

The  description  of  the  baseline  model,  i.e.,  the  model  that 
does  not  include  decision  errors  and  control  delays,  follows 
to  a  large  extent  that  of  Wu,  Metzler  and  Finderman  [51]. 
The  database  unit  in  Figure  8.1  contains  three  servers  in 
parallel  to  answer  three  classes  (A,  B,  C)  of  queries  for 
which  relevant  information  can  be  found  in  the  partitioned 
sets  A,  B,  C  of  the  database,  respectively.  Server  SAB 
contains  database  class  A  as  the  primary  class  and  database 
class  B  as  the  secondary  class.  Server  SBC  contains  database 
class  B  as  the  primary  class  and  database  class  C  as  the 
secondary  class.  Server  SCa  contains  database  class  C  as  the 
primary  class  and  database  class  A  as  the  secondary  class. 
The  failure  of  a  server  implies  the  loss  of  two  classes  of  data 
within  the  server.  A  system  level  failure  is  declared  when 
two  servers  fail,  in  which  case  one  class  of  data  is  completely 
lost.  The  queues  preceding  servers  SAB,  SBC,  and  SCA  are 
named  QAC ,  Qbc,  and  QCA ,  respectively.  All  queues  are  of 
sufficient  capacity.  Service  is  provided  on  a  FCFS  basis  at 
each  server. 

The  three  delay  elements  of  average  delay  1/A,  imply  that 
there  are  always  three  customers  present  in  the  unit  at  any 
given  time.  A  new  query  is  generated  at  a  delay  element 
upon  the  completion  of  the  service  to  a  query  at  one  of  the 
servers.  The  delay  elements  are  intended  to  be  also  reflective 
of  the  response  time  to  the  querying  customers  by  other 
service  nodes  in  the  system  that  are  not  explicitly  modeled. 
Any  new  query  is  assumed  to  be  equally  likely  to  seek 
database  class  A  or  B  or  C.  Therefore  routing  probabilities 
pAB ,  pBc,  and  pCA  are  assigned  the  same  values. 

The  use  of  a  queuing  network  model  for  the  database  is 
based  on  its  suitability  to  involve  control  actions  and  to 
capture  their  effects  on  the  system  performance.  The  model 
is  built  in  this  study  with  the  premise  that  event  life 
distributions  have  been  established  for  the  process  of  query 

generation  (exp(T)  =  l-e~^) ,  the  process  of  service 
completion  (exp(^ )) ,  the  process  of  server  failure 
(exp(v)) ,  the  process  of  data  restoration  (exp(y)) ,  and  the 
process  of  unit  overhaul  (exp(^))  when  the  failed  database 
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unit  is  repaired.  All  such  processes  are  independent. 
Standard  statistical  methods  that  involve  data  collection, 
parameter  estimation,  and  goodness  of  fit  tests  exist  for 
identifying  event  life  distributions.  Since  all  event  lives  are 
assumed  to  be  exponentially  distributed,  the  database  unit 
can  be  conveniently  modeled  as  a  Markov  chain  specified  by 
a  state  space  X,  an  initial  state  probability  mass  function 
(pmf)  TtxiP),  and  a  set  of  state  transition  rates  A  [8],  [22].  The 
reader  uninterested  in  the  details  of  model  building  can 
advance  to  the  paragraph  right  above  Equation  (1). 

8.2.1.  State  space  X 

A  state  name  is  coded  with  a  d-digit  number  indicative  of 
all  queue  lengths  and  server  states  in  the  unit.  With  some 
abuse  of  notations,  a  valid  state  representation  is  given  by 
x=QabQbcQcaSabSbcSca,  where  queue  length  QAB,  QBG  QCA 
e  {0,  1,  2,  3}  with  total  length  L  =  Qab+Qbc+  Qca  ^  3,  and 
server  state  SAB,  SBC,  SCa  e  {0,  1,  2}.  Server  state  “2”  =  data 
are  lost  in  both  the  primary  and  the  secondary  classes  in  a 
server,  “7”  =  the  data  in  the  primary  class  have  been  restored 
and  data  in  the  secondary  class  have  not  been  restored,  and 
“0”  =  data  in  both  primary  class  and  secondary  class  in  a 
server  are  intact.  A  server  is  said  to  be  in  the  down  state  if  it 
is  either  at  state  “7”  or  at  state  “2”.  For  example,  state 
110020  indicates  that  server  SAB  is  up  with  one  customer  in 
its  queue,  server  SBC  is  down  with  both  classes  of  data  gone 
and  one  customer  in  its  queue,  and  server  SCa  is  up  and  idle. 
Note  that  the  queue  length  includes  the  customer  being 
served.  There  are  540  valid  states  in  the  baseline  system.  The 
total  number  of  states  is  reduced  to  141  when  all  the  states  of 
system  level  failures  are  aggregated.  A  set  of  alternative  state 
names  are  assigned  from  X=  {7,  2,  ...,  141}  with  000000 
mapped  to  x=l  and  the  aggregated  system  failure  state 
mapped  to  x=141. 

248.2.2.  Initial  state  pmf  {^(0),  x=l,2,  ...,141} 

It  is  assumed  that  the  database  unit  starts  operation  from 
state  x=7,  i.e.,  the  initial  state  probability  is  given  by  vector 
7i(0)  =  [7  0  ...  0].  When  overhaul  is  considered  at  the 
occurrence  of  a  system  level  failure,  all  customers  are 
flushed  out  to  the  delay  elements.  Once  the  database  unit  is 
renewed  and  ready  for  operation  again,  it  starts  at  the  same 
initial  state  jc=7,  and  a  renewal  process  [22]  is  formed. 

248.2.3.  Set  of  state  transition  functions  Pift) 

Events  that  trigger  the  transitions  and  the  corresponding 
transition  rates  are  given  as  follows.  A  newly  generated 
query  enters  one  of  the  servers  with  rate  (3-L)x2l/ 3 .  A 

query  is  answered  at  a  server  with  rate  p.  A  complete  data 
loss  occurs  at  a  server  with  rate  v.  Data  in  the  primary  data 
class  of  a  server  are  restored  with  rate  ypuh  and  data  in  the 
secondary  data  class  of  a  server  are  restored  with  rate  ys, 
where  uj  authorizes  whether  to  restore  the  lost  data  for  the 
primary  class.  Finally,  the  failed  database  unit  is  renewed 
with  rate  cou3  where  u3  decides  whether  to  repair  the  failed 
system. 

Let  X  e  X  denote  the  random  state  variable  at  time  t.  The 
set  of  state  transition  functions 


=  P[X{t)  =  j  I  X(0)  =  /],  ij  =  1,2, -,141  (35) 

for  the  continuous-time  Markov  chain  can  be  solved  from  the 
forward  Chapman-Kolmogorov  equation  [23] 

m  =  P(tX2(u1,u3),P(0)  =  I,P(t)  =  [piJ(t)]  ,  (36) 

where  Q(uj,u3 )  is  called  an  infinitesimal  generator  or  a  rate 
transition  matrix  whose  (zj)th  entry  is  given  by  the  rate 
associated  with  the  transition  from  current  state  i  to  next  state 
j  in  the  rate  transition  table.  State  probability  mass  function 
at  time  t 

n(t)  =  [nI(t)  n2(t)  •••  7t141(t)\,t>0  (37) 

is  computed  by 

7i(f)  —  7t{0)P(f) .  (38) 

At  this  point  a  baseline  Markov  model  for  the  database 
unit  of  Figure  8.1  has  been  established.  Since  transition  rate 
matrix  Q  is  dependent  on  control  actions,  the  state  transition 
functions  Pift)  are  being  controlled,  and  so  are  the  state 
probabilities. 

8.2.4.  Restoration  and  overhaul 

Our  ultimate  goal  is  to  eliminate  all  single  point  failures, 
and  to  mitigate  the  effects  of  a  single  server  failure  on  the 
performance  of  the  database  unit.  Our  approach  is  to  base  the 
supervisory  control  actions  on  the  state  information,  which 
effectively  alter  the  transition  rates  when  loss  of  data  occurs 
in  a  single  server. 

Taking  into  consideration  the  symmetry  of  the  model,  the 
control  policy  is  described  only  for  the  case  of  a  failed  server 
SAB.  The  control  policies  considered  for  this  study  are 
summarized  as  follows. 

0,  SAB  =  2,  SBc  serves,  SCA  serves  (no  restoration) 

7,  SAB  =  2,  SBC  serves,  SCA  restores  class  A  data 

The  presence  of  supervisory  control  in  the  transition  rate 
matrix  is  seen  via  uh  u3,  l-uh  and  l-u3.  The  values  of  u2,  u3 
represent  specific  control  actions  associated  with  data 
restoration,  and  unit  overhaul,  respectively.  Unit  overhaul 
occurs  only  at  the  unit  failure  state  141. 

The  complete  baseline  model  is  provided  in  [51]  in  the 
form  of  a  rate  transition  table,  where  an  additional  control 
variable  u2  was  present.  u2  controls  routing  probabilities 
when  data  loss  occurs  in  a  server.  u2  is  removed  in  this 
section  because  the  small  number  of  queries  in  the  system 
makes  the  additional  benefit  afforded  by  routing  control  less 
obvious  to  observe. 

8.3.  Model  augmentation  to  include  errors  &  delays 

This  subsection  focuses  on  modeling  the  effects  of 
decision  errors  and  control  action  delays  upon  entering  a 
state.  These  two  undesirable  effects  can  be  intertwined.  To 
quantify  their  individual  impact  on  performance,  they  are 
separated  into  the  class  of  decision  errors  when  a  control 
action  is  taken  incorrectly  but  immediately  upon  entering  a 
state,  and  the  class  of  delayed  control  actions  when  a  correct 
control  action  is  taken  but  after  some  time  delay.  In  addition, 
there  are  deterministically  diagnosable  systems  for  which  the 
only  cost  of  diagnosis  is  time  [8].  Two  augmented  models 
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will  be  generated  in  this  subsection  representing  a  controlled 
database  unit  with  decision  error,  and  one  with  control  action 
delay,  respectively.  Each  model  will  contain  201  states. 

8.3.1.  Effect  of  decision  error 

The  supervisory  control  considered  in  this  study  is  state 
information-based.  Upon  entering  a  state,  say,  A,  any 
information  deficiency  can  result  in  uncertainty  in  decision 
making  as  to  whether  to  take  a  control  action  or  what  control 
actions  to  take.  In  this  case,  every  decision  carries  a  risk. 

An  example  of  a  decision  error  with  the  database  unit 
would  be  that  upon  a  server  failure  a  wrong  server  is  being 
identified  as  having  failed.  More  specifically,  SAB,  for 
instance,  has  failed.  SCa,  however,  is  mistakenly  thought  to 
be  the  failed  one.  Based  on  the  false  information,  the  control 
action  would  be  for  SBC  to  restore  data  class  C  in  SCa, 
whereas  SAB  would  be  expected  to  continue  to  work.  As  a 
consequence  of  a  wrong  decision,  none  of  the  servers  can 
process  queries  for  a  period  of  time.  The  database  unit  is  said 
to  have  entered  an  intermittent  error  state.  It  is  assumed  that 
from  this  state,  only  transitions  to  more  server  failures,  or  to 
the  recovery  to  original  destination  state  can  occur.  Figure 
8.2  depicts  a  generalized  representation  of  such  a  case. 

Without  loss  of  generality,  let  A  be  a  state  that  is  entered 
upon  a  total  data  loss  in  a  server.  Let  C  be  the  state  entered 
upon  the  completion  of  primary  database  restoration 
associated  with  the  data  loss.  Let  B1  through  Bn  be  the  states 
representing  completions  of  services  at  other  n  servers.  Let 
G7,  ...,  G/  be  the  state  entered  upon  the  arrival  of  a  new 
query  in  one  of  the  server  queues.  Let  Fx  through  Fm  be  the 
states  entered  upon  data  loss  at  other  m  servers.  The  notion 
of  intermittent  state  I  is  introduced,  as  shown  in  Figure  8.2, 
to  allow  the  representation  of  imperfect  decision  making 
upon  entering  A.  Therefore,  there  is  an  intermittent  error 
state  for  each  state  that  involves  outgoing  transitions  with 
weakened  control  authorities  due  to  some  decision  errors.  In 
the  database  unit  of  Figure  8.1,  altogether  60  states  are  added 
to  the  original  141  state  baseline  model.  Note  that  states  G-  s 
are  not  shown  explicitly  in  Figure  8.2,  and  they  can  be 
regarded  as  part  of  F-  s  from  this  point  on.  It  is  assumed  that 
once  the  primary  database  restoration  takes  place  for  a 
particular  server,  the  secondary  restoration  is  error  free. 
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Figure  8.2  Decision  error  modeling  w.  an  intermittent  error 
state 


Let  XA  c  denote  the  transition  rate  from  state  A  to  state  C 

in  the  absence  of  decision  error  to  restoration  of  primary 
database  associated  with  the  most  recent  data  loss.  Let  u  be 
the  probability  of  successful  restoration  given  that  the  event 
of  restoration  occurs.  ( 1-u )  then  is  referred  to  as  the 
thinning  [8]  of  the  Poisson  arrival  process  associated  with 
the  restoration.  The  split  of  rate  XA  c  into  rate  uXA  c  and 

rate  (7  -  u)XA  c  is  sometimes  also  called  a  decomposition 

[22]  of  a  Poisson  arrival  process  into  type  1  with  probability 
u  and  type  2  with  probability  (1-u). 

An  imperfect  decision  corresponds  to  the  value  of  u  being 
less  than  unity.  As  a  consequence,  the  authority  of 
supervisory  control  that  is  supposed  to  reinforce  the 
restoration  process  has  been  weakened.  The  smaller  the 
value  of  u ,  the  weaker  the  control  authority  is. 

The  rate  of  recovery  from  decision  error  is  denoted  by  rc. 
To  state  the  fact  that  recovery  from  an  intermittent  error  state 
to  restoration  cannot  be  faster  than  the  error- free  (u=l) 
restoration  process,  rc  <  XAC  is  enforced.  On  the  other 

hand,  the  outgoing  transition  rates  from  the  intermittent  error 
state  to  the  states  of  data  loss  in  other  servers,  i.e.,  from  I  to 
Fb  i=l,  2,  ...  ,  m,  are  bounded  below  by  the  corresponding 
rates  going  from  A  to  Ft.  These  transitions  further  reduce  the 
likelihood  of  reaching  state  C. 

It  is  now  shown  that  decision  errors  always  degrade  the 
performance  in  terms  of  the  state  transition  probability  PAC 
which  is  the  probability  that  restoration  to  state  C  occurs 
given  that  the  state  is  A.  It  turns  out  that  this  probability  is 
readily  obtained  for  a  Markov  chain  [8]. 


_  ulAC 

AL  A  (A) 

(40) 

where 

A  (A)  =  XAB]  +  •  •  •  +  XABn  +  XAF]  +  • 

”  +  /Wm  +^AC 

(41) 

without  decision  error,  in  which  case  u  =  7  in  (6),  and 
A  (A)  =  2ABj  +  •  •  •  +  XABn  +  AAFj  +  ■  ■  •  +  kAFm  +  uAac  +  (7  -  u)Aac  (42) 

with  decision  error,  in  which  case  u  <7.  The  denominators 
of  (41)  and  (42)  are  the  same.  Apparently,  (40)  is 
proportional  to  u ,  and  is  the  largest  at  u=l  when  there  is  no 
decision  error.  On  the  other  hand,  flow  balance  at  state  I 
yields 

m 

TCi  =(1-  u)Aa  c7Ta  -  (  X  XIF.  +  rc)7Tj  ,  (43) 

i-l 

from  which  the  following  expression  for  7Tj(t)  in  terms 
of  7CA(t)  at  steady  state  is  obtained 

nl (°°)  =  F  n  (co) .  (44) 

(44)  is  proportional  to  1-u. 

Some  results  of  numerical  calculation  will  be  presented  in 
Subsection  D  based  on  the  state-augmented  model  of  the 
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database  unit  of  Figure  8.1  that  show  how  certain 
performance  measures  depend  on  the  probability  of  the 
restoration  decision  error. 

8.3.2.  Effect  of  delayed  control  actions 

Time  required  for  diagnosis  can  be  regarded  as  the 
universal  cause  of  a  control  action  delay.  Time  delay  can  be 
traded  off  in  some  applications  with  the  decision  error  to 
minimize  their  combined  effects.  This  subsection  focuses  on 
the  discussion  of  the  effect  of  time  delay  alone. 

An  example  on  the  control  action  delay  with  the  database 
unit  of  Figure  8.1  would  be  that  a  total  loss  of  data  on  a 
server  is  not  immediately  observed.  As  a  result,  the  action  of 
data  restoration  is  delayed. 

As  in  the  previous  subsection,  let  A  be  a  state  that  is 
entered  upon  a  total  loss  of  data  in  a  server.  Let  C  be  the 
state  entered  upon  the  completion  of  primary  database 
restoration  associated  with  the  data  loss.  States  B1  through 
Bm  and  states  F,  through  Fm  also  follow  the  earlier 
definitions.  Figure  8.3  depicts  a  proposed  model  capable  of 
describing  a  delayed  restoration  action  by  an  exponentially 

distributed  random  amount  with  average  e)-7upon  entering 
state  A. 

In  a  more  general  case,  there  can  be  an  A-phased  delay 
implemented  in  the  augmented  model  by  inserting  A  states 
D1  through  DN  in  series  between  states  A  and  C.  Each  state 
Di  retains  outgoing  transitions  to  all  B1  through  Bm  and  Fx 
through  Fm,  in  addition  to  transition  to  Di+1.  The  total 
amount  of  delay  before  restoration  action  is  bounded  below 
by  random  variable  D  =  Dj-\ —  +  DN  ,  with  a  generalized 
Erlang  distribution  [22] 

/  N  S 

(45) 

i=l  s  +  °i 

One  may  use  an  A-stage  Erlang  to  approach  a  constant  delay, 
or  an  A-stage  hyper-exponential  to  approach  a  highly 
uncertain  delay,  or  a  mixture  of  the  two  to  acquire  more 
general  properties  [8]. 
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Figure  8.3  Control  delay  modeling  w.  a  single-stage  delay  state 


Note  that  there  are  two  significant  differences  between  the 
decision  error  model  of  Figure  8.2  and  the  control  delay 
model  of  Figure  8.3.  First,  the  link  to  restoration  of  primary 
database  is  present  in  Figure  8.2  with  a  smaller  likelihood  of 
transition,  whereas  the  link  to  restoration  without  delay  is 
absent  in  Figure  8.3.  In  addition,  all  links  to  service 


completion  are  absent  in  Figure  8.2,  but  present  in  Figure 
8.3.  Therefore,  these  are  two  cases  of  different  nature. 

With  a  single-stage  delay  for  each  state  entered  upon  a 
total  loss  of  data  in  a  server,  60  states  are  added  to  the 
baseline  model.  Numerical  results  on  the  effect  of  control 
action  delay  will  be  presented  in  the  next  subsection. 


8.4.  Performance  analysis  and  discussion 

8.4.1.  Time  to  system  failure 

When  u3=0 ,  the  augmented  Markov  chain  model  for  the 
database  unit  contains  one  absorbing  state  x=201  at  which 
the  chain  remains  forever  once  it  is  entered.  This  is  the  state 
of  system  level  failure.  The  rest  of  200  states  are  transient 
states.  Decompose  the  state  probability  vector 

*(0stM0  MOL  (46) 

1x200  lxl 

where  vector  nff)  contains  the  transient  state  probabilities, 
and  Tift)  is  the  absorbing  state  probability.  Decomposing  the 
rate  transition  matrix  Q  and  the  state  transition  function 
matrix  P(t )  solved  from  (36)  accordingly  yields 


' Qn 

Qn 

,  Pit) = 

>/;(  0 

PnQ) 

0 

0 

0 

1 

(47) 


From  (36),  (38),  and  (46),  it  can  be  determined  that  the 
probability  density  function  of  time  to  system  failure,  or  time 
to  absorption,  is  given  by 

xa(t)  =  nr(P)Pn{t)Q12,na(0)  =  0,  (48) 


where 


7Tt(0)  =  [1  0  -],P11(t)  =  eQ"t.  (49) 


In  addition,  the  mean  time  to  failure  of  the  database  unit  can 
be  shown  to  be  [8] 


1 


MTTF  =  -7rT(0)Qjl lr,  lr  = 


1 


(50) 


Figure  8.4  below  shows  the  dependence  of  mean  time  to 
failure  of  the  database  unit  on  probability  of  correct  control 
action  for  data  restoration  with  restoration  rate  y  as  a 

parameter.  The  plot  indicates  that  MTTF  is  sensitive  to 
restoration  rate,  and  becomes  more  sensitive  to  supervisory 
control  coverage  at  a  higher  restoration  rate.  The  relative 
robustness  of  MTTF  with  respect  to  supervisory  control 
coverage  can  be  attributed  to  the  fact  that  recovery  has  taken 
a  most  optimistic  path  with  rc  =  2iAc  >  after  a  decision  error 

has  been  made. 
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Figure  8.4  Unit  MTTF  versus  control  coverage 
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Figure  8.5  Unit  MTTF  versus  control  delay 
Figure  8.5  above  shows  the  dependence  of  mean  time  to 
failure  of  the  database  unit  on  expected  control  action  delay 
for  data  restoration  with  restoration  rate  7  as  a  parameter.  It 
is  expected  that  control  action  delay  affects  MTTF  more 
drastically  when  restoration  rate  is  high.  Control  action  delay 
becomes  dominant  in  how  long  it  takes  to  restore  data  when 
it  becomes  comparable  to  average  time  required  to  perform 
data  restoration. 

8. 4. 2.  Steady-state  availability 

Suppose  as  soon  as  the  database  unit  reaches  a  system 
level  failure,  an  overhaul  process  starts  with  all  the 
customers  flushed  out  to  the  delay  elements.  Suppose  with  a 
rate  co  the  unit  is  repaired.  At  the  completion  of  the  repair  to 
condition  7r(0) ,  the  unit  immediately  starts  to  operate  again. 
In  this  case  u3  is  set  to  1  in  the  model,  whereas  it  is  set  to  0  in 
the  case  of  an  absorbing  chain.  The  existence  of  a  unique 
steady-state  distribution  of  the  Markov  chain  when  u3=l  is 
guaranteed  if  the  chain  is  irreducible  (or  ergodic)  [22].  The 
steady  state  availability,  which  can  be  roughly  thought  of  as 
the  fraction  of  time  the  database  unit  is  up,  is  given  by 

^sys  —  (51) 

where  n2oi  (°°)  is  determined  by  solving 

*(~)G  =  0,  and  ifiX-H)  =  1.  (52) 
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Figure  8.6  Steady-state  availability  versus  control 
coverage 


X- 6.  1 2.  v=0  005  ©=0  01 


Figure  8.7  Steady-state  availability  versus  control  delay 

Figure  8.6  and  Figure  8.7  show  the  steady-state 
availability  as  a  function  of  supervisory  control  coverage  and 
a  function  of  expected  control  action  delay.  It  can  be  seen 
that  both  long  delays  and  slow  restoration  reduce  the 
availability  to  unacceptable  levels.  Explanations  on  the 
insensitivity  of  the  availability  with  respect  to  coverage  and 
delay  under  slow  restoration  conditions  follow  those  for 
Figure  8.4  and  Figure  8.5. 

8.4.3.  Response  time 

Consider  again  the  irreducible  chain  studied  in  the  previous 
subsection.  Let  7;  /-  be  the  indicator  function  associated  with 

transition  from  state  i  to  state  y,  and  be  the  corresponding 


entry  in  transition  rate  matrix  Q.  Let  Nt  be  the  total  number 
of  queries  in  queue  at  state  i.  Then  the  total  expected  number 
of  queries  in  queue  at  steady-state  is  given  by 


201 

E[X]= 

i=l 

and  the  arrival  rate  at  steady-state  is 

(53) 

201  201 

4  =  I  XjM  X  Iij-qy  . 

(54) 

i=l  j=l 


The  calculation  of  the  response  time  at  steady-state  then 
follows  Little’s  Theorem E[X]  =  2isE[R]  . 

Figure  8.7  and  Figure  8.8  show  the  average  response  time 
as  a  function  of  supervisory  control  coverage  and  a  function 
of  control  action  delay,  respectively.  Unlike  the  other 
performance  measures,  the  sensitivity  of  the  average 
response  time  remains  relatively  significant  at  a  low 
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restoration  rate. 
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Figure  8.8  Average  query  response  time  versus  control  coverage 


Figure  8.1 1  Service  overhead  versus  control  delay 
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Figure  8.9  Average  query  response  time  versus  supervisory  control 

delay 

8.4.4.  Overhead 

Overhead  is  a  quantity  introduced  to  reflect  the  ratio  of  the 
time  invested  on  helping  the  database  unit  to  survive  longer 
to  its  overall  busy  time.  It  is  a  measure  of  the  cost  of 
supervisory  control.  More  specifically, 
q  _  restores  or  fails  |  unit  is  not  failed] 

P: {[Sab  restores  or  fails  or  serves  |  unit  is  not  failed] 
Overhead  0  is  calculated  for  the  irreducible  chain  ( u3=l ) 
as  a  function  of  supervisory  control  coverage  and  a  function 
of  supervisory  control  delay.  These  are  shown  in  Figure  8.10 
and  Figure  8.11.  As  in  the  case  of  availability,  overhead  at 
the  steady-state  becomes  unacceptably  high  at  low 
restoration  rate.  It  is  also  sensitive  to  control  coverage  and 
delay  when  restoration  rate  is  high. 
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Figure  8.10  Service  overhead  versus  control  coverage 
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9 .  Performance  Analysis  of  a  Single  Queue 
Database  Unit  Subject  to  Control  Delays  and 
Decision  Errors 

9.1.  Problem  description 

REVIOUS  work  by  Wu,  Metzler,  and  Linderman  [51]  on 
a  redundant  database  unit,  as  shown  in  Figure  9.1, 
demonstrated  the  advantages  of  using  supervisory  control  in 
conjunction  with  the  design  of  a  redundant  architecture. 
Improvements  were  observed  in  the  mean  time  to  system 
failure,  the  steady  state  availability,  the  expected  response 
time,  and  the  service  overhead  of  the  database  unit.  On  the 
other  hand,  much  progress  has  been  made  in  diagnosability 
of  discrete  event  systems,  i.e.,  whether  it  is  possible  to  detect 
a  failure  occurrence  after  a  finite  delay,  and  in  active 
acquisition,  i.e.,  resource  allocation  in  terms  of  use  of 
sensors  for  state  estimation  [37].  Recognizing  that  it  is 
impractical  to  assume  perfect  knowledge  of  state  information 
at  all  times,  Wu,  Metzler,  and  Linderman  [50]  instituted  the 
idea  of  incorporating  decision  errors  or  control  delays  into 
modeling  the  database  unit,  assuming  that  probability  of  a 
decision  error  is  known  and  the  length  of  control  action 
delay  is  random  with  a  known  distribution. 

The  goal  of  this  section  is  to  analyze  the  same  set  of 
performance  of  a  database  unit  with  a  configuration  shown  in 
Figure  9.2,  in  the  presence  of  decision  errors  and  control 
delays.  When  multiple  servers  provide  service  to  a  single 
class  of  customers,  it  is  known  that  system  time  generally 
benefits  from  committing  a  customer  to  a  server  at  the  last 
possible  instant  [8].  The  architecture  of  in  Figure  9.2 
reflects  our  intention  to  capitalize  on  such  possible  benefits 
in  a  multiple  class  and  multiple  server  system,  and  on 
potentially  more  effective  use  of  redundancy.  In  addition, 
unlike  the  earlier  study  [50]  where  the  effect  of  decision 
errors  and  that  of  control  delays  were  investigated  separately, 
this  section  integrates  the  effects  of  decision  errors  and 
control  action  delays  into  a  single  Markov  model. 


Figure  9. 1  Original  architecture  of  a  partitioned  database  unit 


Figure  9.2  Alternative  architecture  of  a  partitioned  database  unit 

The  organization  of  this  section  is  similar  to  that  of  [50] 
from  which  most  background  material  is  also  drawn.  In 
particular,  Subsection  B  describes  the  model  of  the 
controlled  database  unit  of  the  new  architecture  as  seen  in 
Figure  9.2,  also  taking  into  consideration  of  control  action 
delays  and  errors.  Subsection  C  defines  the  performance 
measures  and  presents  the  numerical  results  of  system 
performance  evaluated  with  respect  to  the  average  delay  of 
control  action  and  the  probability  of  decision  error. 
Subsection  D  draws  concludes. 

9.2.  Modeling 
9.2.1.  Baseline  model 

The  database  unit  presented  in  Figure  9.2  consists  of  three 
servers  in  parallel.  In  this  configuration,  the  database  server 
is  partitioned  into  three  classes.  Each  class  is  given  a 
designation  of  A,  B,  or  C.  Each  of  the  three  servers  is  meant 
to  serve  one  specific  data  class  (primary  class)  as  well  as 
backup  another  data  class  (secondary  class).  Thus,  server  SAB 
has  class  A  as  its  primary  class,  while  class  B  is  its  secondary 
class.  Server  SBc  has  class  B  as  its  primary  class,  while  class 
C  is  its  secondary  class.  Server  SCA  has  class  C  as  its  primary 
class,  while  class  A  is  its  secondary  class.  A  server  failure 
implies  the  loss  if  both  the  primary  and  secondary  data 
classes.  If  server  SAB  fails,  server  SCA  s  secondary  class  (A) 
can  be  used  to  restore  server  SABs  primary  class  (A).  Once 
this  is  completed,  server  SBc  s  primary  class  (B)  restores 
server  SAB’s  secondary  class  (B).  There  are  instances  where  a 
secondary  is  allowed  to  serve  queries,  which  will  be  outlined 
shortly.  A  failed  server  always  has  its  primary  data  class 
restored  before  its  secondary  data  class.  The  entire  database 
unit  fails  when  two  server  failures  occur,  resulting  the  loss  of 
an  entire  class  of  data.  All  servers  are  fed  by  a  common 
queue.  This  queue  is  of  sufficient  capacity  to  contain  all 
queries.  The  service  hierarchy  for  queries  located  in  the 
queue  depends  upon  how  control  variables  are  defined. 

Once  a  query  leaves  a  server  it  vanishes  into  a  delay 
element,  where  after  an  average  time  delay  of  lfk  units  a  new 
query  of  an  independent  class  emerges  at  the  end  of  the 
queue.  The  delays  are  meant  to  represent  the  response  time 
of  other  elements  in  the  system  outside  of  the  database  that 
are  not  explicitly  modeled  [45].  The  use  of  three  delay 
elements  indicates  that  there  will  be  no  more  than  three 
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queries  in  the  unit  at  any  given  moment.  Once  leaving  a 
delay  element,  a  query  has  an  equal  probability  of  requesting 
class  A,  B,  or  C  data. 

This  system  is  viewed  as  a  queuing  network,  where  event 
life  distributions  are  dependent  on  the  process  of  query 
generation  (exp(T)  =  l-e'^  ),  service  completion  (expQ/)) , 
server  failure  (exp(v)),  data  class  restoration  (exp<y)),  and 
system  overhaul  (exp(^u))  which  occurs  when  the  unit  fails. 

All  of  these  processes  are  independent.  Since  all  event  lives 
are  assumed  to  be  exponentially  distributed,  a  Markov  chain 
is  resulted.  This  chain  has  a  state  space  %  ,  initial  state 
probability  mass  function  (pmf)  nx  (0) ,  and  a  set  of  state 
transition  rates  A  [8]. 

(a)  State  Space  % 

A  state  name  is  created  as  an  alphanumeric  string  of  6 
characters  to  indicate  the  status  of  the  queue  as  well  as  the 
status  of  the  servers  in  the  system.  States  are  represented  as 
x=Q3Q2QiSabSBcSca ,  where  Qi  e  {0,  A,  B,  C]  represents 
whether  the  ith  queue  location,  i  =  1,2,3  ,  is  unoccupied,  or 
occupied  with  a  query  of  class  A,  B ,  or  C;  and 
Sap  e  \l,W,R,P,S,F},  a/i=AB ,  BC ,  CA,  represents  the 

state  of  a  server.  In  particular,  “7”  =  server  is  idle,  “W”  = 
server  is  answering  a  query,  “7?”  =  sever  is  restoring  another 
server,  “ P  ”  =  primary  class  of  server  lost  (which  implies  a 
concurrent  loss  of  the  secondary  class),  “S”  =  primary  class 
of  server  restored  but  secondary  class  of  server  not  yet 
restored.  “F”  =  failure  of  database  unit.  The  unit  fails  when 
two  servers  lose  class  data  concurrently.  As  an  example, 
state  OAA  WII  indicates  that  there  are  3  queries  in  the  system: 
a  query  is  receiving  service  in  server  SAB,  a  query  of  class  A 
waiting  in  queue  position  1,  and  a  query  of  class  A  is  waiting 
in  queue  position  2.  When  examining  the  system  using  the 
x=Q3Q2QiSabSbcSca  convention,  there  are  195  valid  states. 
These  states  can  then  be  mapped  to  the  set^  =  {l, 2,3,. ..,195}, 
where  x=000III  is  mapped  to  %o  = 1 ,  etc. 

(b)  Initial  state  pmf  \7TX  (0),  x  =  1,2, 3,. ..,195} 

It  is  assumed  that  the  system  always  begins  in  state  1. 
Therefore  the  initial  state  probability  is  7r(0)  =  [l  0  0  . . .  0] . 
In  the  case  of  a  system  failure,  queries  that  were  being  served 
are  considered  Tost’  and  sent  back  to  the  delay  elements, 
conformable  to  a  preemptive  policy  [24].  However,  those 
elements  residing  in  the  queue  remain  until  the  system  has 
been  restored.  New  queries  can  arrive  while  the  system  is  not 
functioning.  Once  the  system  is  restored,  it  returns  to  the 
appropriate  state,  dictated  by  the  number  of  queries  currently 
in  queue,  and  what  data  class  those  queries  require.  As  an 
example,  AABFFF  would  transition  to  00A  WWI. 

(c)  Set  of  transition  rates  Py(t) 

A  transition  rate  matrix  is  created  for  all  states  within  the 
system  in  a  manner  outlined  in  Error!  Reference  source  not 
found..  The  rates  that  are  included  in  this  matrix  are  as 


follows.  Queries  arrive  with  a  rate  of  (3-L)X,  where  L  is  the 
queue  length  in  addition  to  the  number  of  queries  currently 
being  served.  Queries  are  served  with  rate  p.  A  server  fails, 
resulting  in  complete  data  loss,  with  rate  v.  The  primary  data 
class  of  a  server  is  restored  with  rate  yp.  The  secondary  data 
class  of  a  server  is  restored  with  rate  ys.  After  the  unit 
completely  fails,  it  is  repaired  with  rate  co. 

Let  X  e  x  be  the  random  state  variable  at  time  t.  The  set 
of  state  transition  functions 

Ptj  (0  =  P\m  =  j\X{  0)  =  i\  i,  j  =  1,2, ...  1 95  (56) 

for  the  continuous-time  Markov  chain  can  be  solved  from  the 
forward  Chapman-Kolmogorov  equation  [8] 

m = me,  m = /,  m = [PiJ  (oi,  (57) 

where  Q  is  an  infinitesimal  generator  of  a  rate  transition 
matrix  whose  (zj)th  entry  is  given  by  the  rate  associated  with 
the  transition  from  current  state  i  to  next  state  j.  The  state 
probability  mass  function  (pmf)  at  a  time  t  is 

m  =  [*i(0  n1  (?)  ...  xm(t)]  (58) 

and  is  calculated  through 

m=mp(t)  (59) 

This  establishes  a  Markov  model  for  the  database  unit  seen 
in  Figure  9.2. 

9.2.2.  Control  Variables 

The  main  objective  of  supervisory  control  is  to  reduce  the 
adverse  effects  of  server  failure.  To  accomplish  this,  a 
combination  of  two  control  variables  is  instituted. 

The  control  variable  uj  is  used  to  dictate  when  the 
secondary  server  is  allowed  to  serve  queries  when  a  failure 
elsewhere  has  occurred.  Uj  is  equal  to  7  if  the  server  whose 
primary  sever  serves  the  data  class  is  down,  or  is  restoring 
the  same  data  class  in  another  server.  Otherwise,  u1  is  equal 
to  0. 

The  control  variable  u2  is  used  to  dictate  which  entity  in 
queue  takes  precedence  when  the  secondary  server  is 
allowed  to  serve.  u2  is  equal  to  7  when  first  query  in  queue 
takes  precedence  at  an  available  server.  Otherwise  u2  is  equal 
to  0 ,  in  which  case  the  primary  query  takes  precedence  over 
the  secondary  query  despite  their  orientation  in  the  queue.  As 
an  example,  server  SAb  is  given  permission  to  serve  both 
primary  and  secondary  data  classes.  If  the  queue  is  ordered 
such  that  B  precedes  A  (e.g.  [OAB]),  then  A  will  be  served 
before  B  is  served.  When  u2  =1,  query  B  would  be  served 
first. 

The  control  variable  u3  is  used  to  dictate  whether  the 
database  unit  restores  itself  after  failure.  u3  is  equal  to  7 
when  unit  repair  is  allowed.  Otherwise  it  is  0. 

9.2.3.  Model  augmentation  to  include  delays  with  errors 

Control  action  delays  can  occur  at  many  state  transitions. 

This  study  examines  the  case  of  a  delay  associated  with 
server  failure.  In  such  a  case,  the  action  of  server  restoration 
is  delayed. 

The  supervisory  control  of  this  system  is  state  information 
based,  thus,  when  a  new  state  is  entered  that  requires  a 


45 


control  action,  information  deficiency  can  lead  to  an 
incorrect  control  action. 


Figure  9.3  Control  delay  with  decision  error  modeling 

In  Figure  9.3,  let  A  be  a  state  that  is  reached  due  to  a 
server  failure.  Let  P  be  the  state  that  is  reached  when  the 
primary  data  class  of  the  server  is  restored.  Let  states 
B lv.., Bm  be  the  states  entered  upon  service  completion  at 
operating  servers.  Let  Gb...,Gn  be  the  states  entered  upon 
the  arrival  of  a  new  query.  Let  Fl5...,Fk  be  the  states  entered 
upon  the  failure  of  another  server.  Let  D  be  the  state  entered 
when  a  correct  server  failure  is  diagnosed.  Let  I  represent  the 
state  entered  if  an  wrong  decision  is  made  in  transitioning 
out  of  state  A. 

In  adding  a  delay  and  control  error,  145  new  states  are 
created.  It  is  assumed  that  there  is  no  error  or  delay 
associated  with  restoring  the  secondary  server  after  the 
primary  has  been  restored.  Since  the  system  cannot 
immediately  recognize  server  failure,  in  the  period  between 
physical  failure  and  diagnosis,  any  queries  entering  the 
system  that  are  allowed  to  be  served  by  the  failed  server  are 
sent  to  it  with  the  belief  that  it  is  functioning.  As  a  result  the 
query  is  kept  waiting  until  the  server  failure  is  diagnosed. 
Only  upon  correct  diagnosis  will  queries  be  prevented  from 
entering  the  server.  Let  5  be  an  exponentially  distributed 
rate  of  control  action  delay,  Figure  9.3  shows  that  a  delay 
can  lead  to  successful  diagnosis  of  a  failed  server,  resulting 
in  the  correct  restoration  of  the  failed  unit,  or  an  incorrect 
diagnosis  of  server  failure,  resulting  in  the  attempted 
restoration  of  a  functioning  unit.  The  conditional  probability 
of  no  decision  error  given  that  a  server  has  failed  is  defined 
as  u4. 

It  was  shown  in  [50]  that  the  effect  of  delay  in  control 
actions  and  the  effect  decision  errors  always  degrade  the 
performance  in  terms  of  the  transition  probability  to 
restoration  of  a  failed  server. 


probability  of  no  decision  error,  and  the  average  delay. 
There  was  no  clear  advantage  gained  in  the  implementation 
of  control  variables  u4  and  u2.  In  this  analysis,  both  values 
are  set  to  one. 

9.3.1.  Mean  time  to  system  failure 

By  setting  u3=0 ,  recovery  to  the  initial  state  will  not  occur, 
and  the  system  cannot  leave  any  of  the  40  failure 
(absorbing)  states  once  it  is  entered.  The  other  300  states 
are  transient  states.  The  state  probability  vector  can  be 
represented  as 

n(f)  =  [7T^)  7zjy\,  (60) 

1x300  1x40 

where  vector  tzx  contains  the  transient  state  probabilities 
and  vector  na  contains  the  absorbing  state  probabilities. 
The  rate  transition  matrix  Q  and  the  state  transition 
function  matrix  P(t)  can  be  decomposed  into 
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From  (57),  (59),  and  (61),  it  can  be  found  that  the 
probability  density  function  of  time  to  system  failure,  or 
time  to  absorption,  is 

Xa  (0  =  (0)P,  I  (0012 ,  na  (0)  =  0,  (62) 


where 


M0)  =  [l  0  0\Pn(t)  =  eQ"‘.  (63) 

Thus,  the  mean  time  to  failure  can  be  determined  through 
the  equation  [24] 


MTTF  =  -nT  (COgf/ /T  >  It  = 


(64) 


X=6,  [i=12,  v=0.005  <9=0.01  Yp=0.05  ys=0.05 


Figure  9.4  MTTF  versus  u4 


9.3.  Performance  analysis 

In  this  section,  performance  measures  in  terms  of  mean  time 
to  system  failure,  steady-state  system  availability,  expected 
query  response  time,  and  expected  service  overhead  are 
reintroduced  following  the  presentation  in  [50].  The 
performance  is  then  evaluated  for  the  model  of  the  database 
system  of  Figure  9.2.  The  goal  is  to  quantify  the  advantage 
of  architecture  in  Figure  9.2  over  that  in  Figure  9.1,  and  to 
understand  the  dependence  of  the  performance  on 


Figure  9.  4  shows  the  mean  time  to  failure  as  a  function  of 
the  conditional  probability  u4  with  the  delay  8  as  a 
parameter.  As  the  probability  of  successful  error  diagnosis 
increases,  the  mean  time  to  failure  also  increases.  The  ability 
to  successfully  diagnose  errors  allows  the  unit  to  attend  to 
failures  more  rapidly.  This  reduces  the  ability  for  another 
server  to  fail  which  results  in  unit  to  failure.  Also,  as  the  rate 
of  delay  increases,  the  mean  time  to  failure  decreases.  A 
smaller  delay  makes  the  system  easier  to  enter  an  intermittent 
error  state.  This  is  particular  to  the  system  setup.  The 
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argument  is  that  when  less  time  is  allowed  to  diagnose  a 
failure,  the  likelihood  of  being  able  to  identify  correctly 
where  the  failure  has  occurred  is  reduced.  As  a  result,  the 
gain  in  less  decision  error  outweighs  the  loss  in  longer  delay 
in  diagnosis  process. 

Figure  9.5  further  observes  the  effects  of  delay,  with 
restoration  rate  as  the  parameter.  It  can  be  seen  that  varying 
the  success  rate  of  failure  diagnosis  has  a  more  profound 
effect  on  the  mean  time  to  failure  than  the  restoration  rate. 
For  small  values  of  delta,  intermittant  error  states  are  reached 
more  rapidly  than  other  error  states.  As  the  delay  time 
increases,  the  amount  of  time  before  an  intermittant  error 
state  is  reached  increases  as  well.  Thus,  the  mean  time  to 
failure  increases. 


fc=6,  |i=12,  v=0.005  co=0.01  u4  =  0.75 


9. 3.2.  Steady  State  Availability 
The  database  unit  is  allowed  to  restore  itself  after  it  has 
failed  by  setting  u3  =  1.  The  unit  is  then  restored  with  rate  co. 
The  Markov  chain  for  the  system  is  irreducible,  and 
therefore  has  a  unique  steady-state  distribution  [8].  The 
system’s  availability  is  the  percentage  of  time  when  the 
system  is  not  in  a  failure  state,  or 

Asys=l-  5>/(~)  (65) 

i=301 

where  nm  to  n340  can  be  solved  through 

y°° )Q  =  0,  and  Xxl'I  xx  (°°)  (66) 


Figure  9.6  Steady  state  availability  versus  u4 


Figure  9.6.  shows  the  steady  state  availability  as  a  function 
of  u4  with  the  control  delay  as  a  parameter.  An  increase  in 
the  success  of  diagnosis  results  in  greater  system  availability. 
However,  for  the  delay  rates  chosen,  the  system  has  an 
unacceptably  low  availability.  The  faster  the  delay  rate 
results  in  increased  availability  as  well.  A  rapid  failure 
diagnosis  reduces  the  ability  of  other  servers  to  fail. 

Figure  9.7  shows  the  steady  state  availability  as  a  function 
of  control  delay  with  restoration  as  a  parameter.  Restoration 
has  little  effect  on  reducing  the  effect  of  a  prolonged  delay 
on  the  system’s  availability.  An  increase  in  restoration  rate 
only  marginally  increases  the  system’s  availability. 


A=6,  (j.=12,  v=0.005  to=0.01  u,=u2=1  u4  =  0.75 


Figure  9.7  Steady  state  availability  versus  control  delay 

9.3.3.  Response  Time 

The  response  time  of  the  system  is  a  measure  of  the  amount 
of  time  a  query  resides  in  the  database  unit.  It  can  be  found 
by  solving  Little’s  Theorem  E[X ]  =  2isE[R\  [8],  where  E[R ] 
is  the  response  time.  The  equation  can  be  solved  using 

340  /  X 

=  (67) 

i=\ 

where  Nt  is  the  total  number  of  queries  in  queue  at  state  i. 
The  steady  state  arrival  is 

-ME™  (68) 

where  Iy  is  an  indicator  function,  valued  as  0  if  there  is  no 
transition  from  state  i  to  j  and  1  if  there  is  a  transition  from 
state  i  to  j,  while  qy  is  the  corresponding  value  in  the 
transition  rate  matrix,  Q. 

Figure  9.8  shows  the  response  time  as  a  function  of  u4  with 
the  delay  rate  as  a  parameter.  As  the  success  rate  increases, 
the  response  time  of  the  system  decreases.  Queries  pass  more 
rapidly  through  the  system  when  server  errors  are  properly 
diagnosed.  Also,  the  less  delay  queries  experience,  the  more 
rapidly  they  will  pass  through  the  system. 
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2.5 


fc=6,  |i=12,  v=0.005  (o=0.01  yf=0.1  u,,=u2=1 


5L=6.  |1=12,  v=0.005  ra=0.01  yf=0.1  ^=1^=1 


Figure  9.9  shows  the  response  time  as  a  function  of 
control  delay  with  restoration  as  a  parameter.  As  the  delay 
time  increases,  queries  move  more  slowly  through  the 
system,  increasing  the  response  time.  Also,  reduced 
restoration  rates  of  failed  servers  will  also  increase  the  time  a 
query  resides  in  the  system. 


X=6,  (1=12,  v=0.005  oa=0.01  yf=0.1  u.,=u2=1 


Figure  9.9  Response  time  versus  control  delay 
9.3.4.  Overhead 

Overhead  is  a  measure  used  to  show  the  ratio  of  the  time 
server  invests  on  helping  the  database  unit  to  survive  longer 
to  its  overall  busy  time.  It  can  be  seen  as 

Pr[S^g  restores  /  fails  \  unit  has  not  failed ] 


0  = 


Pr[S^5  restores  /  fails  /  serves  \  unit  has  not  failed ] 


(69) 


Figure  9.10  shows  the  overhead  results  for  the  irreducible 
chain  as  a  function  of  u4  and  uses  the  control  delay  as  a 
parameter.  All  delay  times  display  the  same  trend  in 
overhead  value.  The  more  successful  the  system  is  in 
diagnosing  errors,  the  less  time  servers  will  have  to  dedicate 
to  restoring  other  servers. 

Figure  9.11  shows  overhead  as  a  function  of  control  delay 
with  restoration  as  a  parameter.  An  increase  in  delay  time 
causes  servers  to  dedicate  themselves  to  the  restoration  of 
other  servers  and/or  increases  the  possibility  that  those 
servers  will  themselves  fail. 


9.4.  Section  summary 

The  study  of  a  single  queue,  multiple-class  architecture 
did  not  prove  to  show  improved  performance  over  the 
multiple-queue  architecture  in  the  absence  of  delays  with 
control  errors.  This  is  attributed  to  the  low  utilization  of  the 
system  that  houses  only  3  queries,  and  to  the  routing  control 
that  partially  offsets  the  disadvantage  of  the  multiple  queue 
architecture.  The  small  number  of  queries  is  purposely 
chosen  to  make  analytic  modeling  more  tractable.  It  is 
expected  that  an  increase  in  the  number  of  queries  will  reveal 
the  benefits  of  the  single  queue  architecture.  A  simulation 
study  will  be  conducted  to  validate  the  conjecture. 

Analysis  of  the  Markov  chain  shows  that  control  action 
delays  and  decision  errors  can  cripple  the  database  unit.  The 
benefits  of  allowing  a  server  to  serve  a  secondary  class  were 
marginal,  and  therefore,  optimization  of  control  policy 
matters  little.  Simulation  studies  with  higher  query  traffic 
will  better  show  if  this  whether  control  policy  will  matter 
more.  In  general  the  ability  to  minimize  decision  error  was 
more  important  than  ensuring  rapid  restoration. 

A  most  interesting  future  research  area  would  be  to 
evaluate  decision  errors  and  control  delays  when  more 
detailed  model  for  decision  and  control  processes  are 
established. 


X=6,  m=12,  v=0.005  to=0.01  y=0.1  u.,=u2=1 


Figure  9.11  Overhead  versus  control  delay 
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1 0.  Simulation  of  the  Single  Queue  Database  Unit  & 
Comparison  with  the  Multi-queue  Unit 

10.1.  Problem  description 

The  above  section  developed  a  single  queue  Markov 
model  of  a  database  unit.  It  compared  that  architecture  with  a 
multiple  queue  database  unit  architecture  devised  by  Wu  and 
Metzler  using  the  performance  measures,  mean  time  to 
failure,  system  availability,  expected  response  time,  and 
system  overhead.  The  single  queue  system  improved  in 
expected  response  time  and  in  system  overhead  as  compared 
with  the  multiple  queue  system.  However,  the  improvement 
observed  was  less  than  expected.  The  above  attempt  failed  to 
consider  the  effect  that  the  varied  arrival  rates  have  on  the 
system.  It  is  expected  that  varying  the  arrival  rate  will  prove 
the  single  queue’s  advantage  over  the  multiple  queue  more 
definitively. 

The  single  queue  model  allows  service  by  the  secondary 
class  server  only  when  the  corresponding  primary  server  is  in 
a  failed  or  restoring  state.  The  Markov  model  is  redefined  to 
allow  the  secondary  class  server  to  serve  at  any  time  when  it 
is  not  otherwise  serving,  failed,  or  restoring.  The  new  model 
contains  153  viable  states.  The  original  model  contained  162 
viable  states.  The  change  in  viable  states  causes  an 
improvement  in  expected  response  time,  system  overhead, 
and  server  utilization. 

This  section  will  numerically  analyze  the  effect  of 
increasing  the  arrival  rate  by  3.33  times  the  nominal  value 
for  the  closed  system.  It  will  also  compare  the  multiple 
queue  system  configuration  and  single  queue  configuration 
through  simulation.  Numerical  analysis  will  be  performed  on 
the  single  queue  system  modified  to  allow  the  secondary 
server  to  serve  at  any  time  it  is  not  busy,  failed,  or  restoring. 
The  single  queue  system  modified  to  allow  the  secondary 
server  to  serve  will  then  be  analyzed  through  simulation.  All 
system  parameters  are  considered  to  be:  X=6,  yp  =  ys  =  0.05,  v 
=  0.005,  jii  —  12,  go  =  0.01,  Uj  =  u2  =  u3  =1  unless  otherwise 
noted. 

10.2.  Original  System  -  Numerical  Analysis 

Figure  10.1  shows  the  effect  of  increasing  the  arrival  rate 
of  the  closed  single  queue  system  on  overhead.  The  system  is 
closed,  therefore  only  three  queries  are  allowed  in  the  system 
at  any  given  moment.  By  reducing  the  amount  of  time  spent 
in  the  process  delay,  the  faster  they  are  returned  to  the  queue. 
This  results  in  a  reduction  in  the  time  a  server  is  idle  and  an 
increase  in  the  time  that  the  server  is  busy.  At  v  =  0.001, 
there  is  an  8%  reduction  in  overhead.  At  v  =  0.01  there  is  a 
10%  reduction  in  overhead.  As  the  failure  rate  increases, 
servers  fail  more  often.  Therefore  other  servers  spend  an 
increased  amount  of  time  repairing  failed  servers.  This 
results  in  a  convergence  of  the  two  overhead  plots. 


Figure  10.1:  Single  queue  system  overhead  versus  failure  rate  with 
varied  arrival  rate 

Figure  10.2  shows  the  effect  of  increasing  the  arrival  rate 
of  the  closed  single  queue  system  on  response  time.  A 
significant  increase  in  response  time  is  observed.  At  y  =  0, 
there  is  a  1.2  time  unit  difference  in  response  time.  At  y  = 
0.1,  there  is  a  0.42  time  unit  difference  in  response  time.  The 
increase  in  response  time  is  due  to  the  effect  of  the 
restoration  rate  relative  to  service  rate.  Queries  are  serviced 
with  a  rate  of  12  queries/unit  time.  By  allowing  rapid  service 
followed  by  a  short  arrival  delay,  queries  pass  rapidly 
through  the  system.  However,  in  the  event  of  a  failure,  the 
restoration  rate  is  varied  only  between  zero  and  0.1,  two 
orders  of  magnitude  lower  than  the  parameters  ju=  12  and  A  = 
20.  Therefore,  queries  that  cannot  be  served  must  wait  in  the 
queue.  When  A  =  6,  these  queries  waited  for  less  time  in  the 
queue  because  they  spent  more  time  in  the  delay  element. 
When  A  =  20,  the  queries  waited  for  more  time  in  the  queue 
because  they  spent  less  time  in  the  delay  element.  Also,  as 
the  restoration  rate  increases,  the  difference  in  response  time 
decreases  because  servers  are  restored  more  quickly  and 
queries  wait  in  the  queue  for  a  lower  amount  of  time. 


Response  Time  (u1=1 ,  u2  =  1)  as  a  function  of  failure  rate 


Figure  10.2:  Single  queue  system  response  time  versus  failure 
rate  with  varied  arrival  rate 

10.3.  Original  System  -  Arena  Simulation 

Figure  10.3  compares  the  response  times  of  the  single 
queue  system  and  the  multiple  queue  system  proposed  by 
Wu  and  Metzler  as  a  function  of  arrival  rate.  The  single 
queue  system  performs  better  than  the  multiple  queue 
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system.  When  the  arrival  rate  is  2,  the  single  queue  system 
outperforms  the  multiple  queue  system  by  4.2  time  units. 
When  the  arrival  rate  is  8,  the  single  queue  system 
outperforms  the  multiple  queue  system  by  6.18.  When  the 
arrival  rate  is  16,  the  single  queue  system  outperforms  the 
multiple  queue  system  by  14.1.  Small  arrival  rates  indicate 
that  few  queries  are  entering  the  system.  Thus,  the  advantage 
of  the  single  queue  system  is  less  apparent.  Greater  arrival 
rates  indicate  more  queries  are  entering  system.  As  the  value 
of  X  increases,  the  greater  the  advantage  of  the  single  queue 
system  has  in  terms  of  response  time. 


Figure  10.3:  Expected  response  times  versus  arrival  rate 

Figure  10.4  compares  the  overhead  of  the  single  queue 
system  and  the  multiple  queue  system  proposed  by  Wu  and 
Metzler  as  a  function  of  failure  rate.  The  single  queue  system 
performs  better  than  the  multiple  queue  system.  When  the 
arrival  rate  is  6,  the  single  queue  system  outperforms  the 
multiple  queue  system  by  5%.  When  the  arrival  rate  is  12, 
the  single  queue  system  outperforms  the  multiple  queue 
system  by  6.6%.  When  the  arrival  rate  is  16,  the  single  queue 
system  outperforms  the  multiple  queue  system  by  9.81%.  As 
the  value  of  X  increases,  the  greater  the  advantage  of  the 
single  queue  system  has  in  terms  of  overhead. 


Figure  10.4:  Expected  response  times  versus  arrival  rate 

10.4.  System  where  secondary  server  is  allowed  to  serve  at 
all  times 

Nominal  arrival  rate,  X=6 

In  the  original  system  the  secondary  server  was  used 
primarily  to  restore  failed  primary  servers.  However,  it  was 
given  the  ability  to  serve  customers  under  special  conditions. 
Now  the  secondary  server  is  allowed  to  serve  whenever  it  is 
able.  Conditions  where  the  secondary  cannot  serve  occur 


when:  it  is  busy,  it  is  failed,  it  is  repairing,  or  the 
corresponding  primary  is  idle.  The  model  for  this  case  will 
be  called  the  ‘New  System’.  The  previously  discussed 
system  will  be  called  the,  ‘Original  System’. 

Figure  10.5  shows  the  effect  of  increasing  the  failure 
rate  of  the  closed  single  queue  system  on  overhead.  This 
system  shows  that  the  new  system  has  a  higher  overhead  than 
the  original  system.  This  is  an  unexpected  result.  By 
increasing  the  number  of  servers  a  query  can  choose,  it  is 
expected  to  result  in  a  lower  overhead.  However,  the  closed 
system  only  allows  three  queries  in  the  system  at  any  time. 
Allowing  queries  to  choose  multiple  servers  causes  the  idle 
time  of  each  server  to  increase.  If  the  idle  time  increases 
more  than  the  busy  time  for  each  server  the  overhead  will 
increase.  The  result  is  demonstrated  in  figure  5.  A  decrease 
in  response  time  and  an  increase  in  utilization  would  support 
this  hypothesis. 


Overhead  as  a  function  of  Failure  Rate,  X  =  6 


Figure  10.5:  Original  single  queue  system  overhead  and  new 
single  queue  system  overhead  versus  failure  rate 

Figure  10.6  shows  the  effect  of  increasing  the  restoration 
rate  of  the  closed  single  queue  system  on  response  time.  The 
trend  in  response  time  as  a  function  of  restoration  rate  is 
expected.  The  new  system  has  a  lower  response  time  then  the 
original  system.  Queries  spend  less  time  in  the  new  system 
than  in  the  original  system.  On  average,  the  new  system  has  a 
9.93%  lower  response  time.  By  allowing  the  secondary 
server  to  serve  whenever  it  is  able,  queries  are  served  faster 
and  spend  9.93%  less  time  in  the  system. 

Response  Time  as  a  function  of  Restoration  Rate,  X  =  6 


Figure  10.6:  Original  single  queue  system  response  time  and 
new  single  queue  system  response  time  versus  restoration  rate 
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Figure  10.7  shows  the  effect  of  increasing  the  restoration 
rate  of  the  closed  single  queue  system  on  server  utilization. 
The  new  system  shows  a  dramatic  increase  in  the  utilization 
of  a  server.  On  the  average,  the  new  system  is  utilized  50.54 
%  more  than  the  original  system.  When  the  secondary  class 
server  is  allowed  to  serve  queries,  a  query’s  waiting  time  in 
the  queue  is  decreased.  In  the  event  of  no  server  error,  all 
servers  are  utilized  as  long  as  one  query  is  requesting  a 
different  class  type  than  the  other  queries.  For  example,  if 
three  queries  are  in  the  system  and  they  are  requesting  class 
A,  A,  and  B  data.  The  A  queries  can  be  served 
simultaneously  by  SAB,  and  SCa  while  the  B  query  is  served 
by  SBc-  In  the  original  system  SCa  would  not  provide  service 
and  the  second  class  A  query  would  wait  in  queue. 


Utilization  as  a  function  of  Restoration  Rate,  X=  6 


Figure  10.7:  Original  single  queue  system  utilization  and  new 
single  queue  system  utilization  versus  restoration  rate 

Nominal  arrival  rate,  2=20 

Next,  an  increase  of  the  query  arrival  rate  is 
implemented  to  further  analyze  the  new  system.  Figure  8 
shows  the  effect  of  increasing  the  query  arrival  rate  on  the 
new  system’s  overhead.  There  is  a  14%  reduction  in 
overhead  for  the  new  system  to  when  the  arrival  rate  was  6. 
However,  the  overhead  remains  greater  than  the  original 
system  at  the  nominal  arrival  rate.  When  the  arrival  rate  was 
6,  servers  became  better  utilized;  however  the  limited 
number  of  queries  caused  an  increase  in  server  idle  time.  The 
result  was  an  increase  in  the  utilization  compared  to  the 
original  system.  In  this  case,  increasing  the  arrival  rate  of 
queries  causes  a  decrease  in  idle  time  because  queries  are 
returned  back  to  the  system  more  rapidly.  This  increase  is 
not  enough  to  equalize  the  overhead  value  to  that  of  the 
original  system. 


Figure  10.8:  Original  single  queue  system  overhead  and  new 
single  queue  system  (X=20)  overhead  versus  failure  rate 

Figure  10.9  shows  the  effect  of  increasing  the  query 
arrival  rate  on  the  new  system’s  response  time.  A  significant 
increase  in  response  time  is  observed.  At  y  =  0,  there  is  a 
0.77  time  unit  difference  in  response  time.  At  y  =  0.1,  there 
is  a  0.3  time  unit  difference  in  response  time.  This  difference 
is  less  than  that  observed  in  figure  2.  This  indicates  that  the 
new  system  improves  the  system  response  time  compared  to 
the  original  system  when  X=20.  The  increase  in  response 
time  is  due  to  the  effect  of  restoration  rate  relative  to  service 
rate.  Queries  are  serviced  with  a  rate  of  12  queries/unit  time. 
By  allowing  rapid  service  followed  by  a  short  arrival  delay, 
queries  pass  rapidly  through  the  system  However,  in  the 
event  of  a  failure,  the  restoration  rate  is  only  varied  between 
zero  and  0.1,  two  orders  of  magnitude  lower  than  parameters 
ju=  12  and  2  =  20.  Therefore,  queries  that  cannot  be  served 
must  wait  in  the  queue.  When  2  =  6,  these  queries  waited  in 
the  queue  for  less  time  because  they  spent  more  time  in  the 
delay  element.  When  2  =  20,  the  queries  waited  for  more 
time  in  the  queue  because  they  spent  less  time  in  the  delay 
element.  Also,  as  the  restoration  rate  increases,  the  difference 
in  response  time  decreases  because  servers  are  restored  more 
quickly  and  queries  wait  in  the  queue  for  a  shorter  amount  of 
time. 


Figure  10.9:  Original  single  queue  system  response  time  (2=6) 
and  new  single  queue  system  (2=20)  response  time  versus 
restoration  rate 
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Figure  10.10  shows  the  effect  of  increasing  the  query 
arrival  rate  on  the  new  system’s  server  utilization.  The  new 
system  shows  a  dramatic  increase  in  the  utilization  of  a 
server.  On  the  average,  the  new  system  is  utilized  53.25  % 
more  than  the  original  system.  This  is  an  increase  in  server 
utilization  compared  to  when  N=6  for  the  new  system.  When 
the  secondary  class  server  is  allowed  to  serve  queries,  a 
query’s  waiting  time  in  the  queue  is  decreased.  In  the  event 
of  no  server  errors,  all  servers  are  utilized  as  long  as  one 
query  is  requesting  a  different  class  type  than  the  other 
queries.  For  example,  if  three  queries  are  in  the  system  and 
they  are  requesting  class  A,  A,  and  B  data.  The  A  queries 
can  be  served  simultaneously  by  SAB,  and  SCa  while  the  B 
query  is  served  by  SBC.  In  the  original  system  SCa  would  not 
provide  service  and  the  second  class  A  query  would  wait  in 
queue. 


Figure  10.10:  Original  single  queue  system  utilization  and  new 
single  queue  system  (X=20)  utilization  versus  restoration  rate 

1 0. 5.  New  System  -  Simulation 

Figure  10.11  shows  a  comparison  of  the  response  time 
of  the  new  and  original  systems  as  a  function  of  arrival  rate. 
At  a  low  arrival  rate  of  X  =  2  there  is  a  2.8  time  unit 
reduction  in  response  time.  At  a  higher  arrival  rate  of  X  =  16, 
this  reduction  is  4.95  time  units.  Queries  pass  through  more 
rapidly  in  the  new  system.  At  low  arrival  rates  this  is  not  as 
apparent  because  the  traffic  is  not  heavy.  As  the  arrival  rate 
increases,  more  queries  are  present  in  the  system  and  the 
advantage  becomes  more  apparent.  The  improvement  of  the 
new  system  over  the  original  system  is  not  as  great  as  the 
original  single  queue  system  over  the  multiple  queue  system. 

Figure  10.12  shows  a  comparison  of  the  overhead  of  the 
new  and  original  systems  as  a  function  of  failure  rate.  The 
new  system  results  in  higher  overhead  values  for  all 
measured  arrival  rates.  At  a  low  arrival  rate  of  X  =  2  there  is 
a  4.4%  increase  in  overhead.  At  a  higher  arrival  rate  of  X  = 
16,  there  is  a  decrease  of  0.7%.  Overhead  is  a  measure  of  the 
percentage  of  time  a  server  spends  in  a  failed/restoring  state 
with  respect  to  the  time  the  system  spends  in  a 
failed/restoring/serving  state.  Low  arrival  rates  result  in  a 
higher  overhead  for  the  ‘new’  system.  The  ‘new’  system 


decreases  the  idle  time  of  each  server  by  allowing  them  to 
serve  more  query  types.  However,  the  service  time  does  not 
increase  as  much  as  the  idle  time  decreases.  This  results  in 
an  increase  in  overhead.  Queries  back  up  in  the  queue  when 
the  query  arrival  rate  becomes  greater  than  the  query  service 
rate.  This  results  in  a  consistently  busy  server  regardless  of 
the  query  class  being  served.  Thus,  the  overhead  rates 
converge  to  the  same  value. 


Figure  10.11:  Simulated  expected  response  time  for  original 
system  and  new  system  as  a  function  of  arrival  rate 


]♦  Oriattel  System  «  New  System  |j 

Figure  10.12:  Simulated  expected  response  time  for  original  system 
and  new  system  as  a  function  of  arrival  rate 

1 0. 6.  Section  summary 

The  single  queue  system  has  been  shown  to  perform 
better  than  the  multiple  queue  system.  An  increase  in  the 
arrival  rate  of  queries  results  in  an  increased  advantage  of  the 
single  queue  system  over  the  multiple  queue  system  with 
respect  to  expected  response  time  and  system  overhead. 
Altering  the  single  queue  system  to  increase  service  by  the 
secondary  class  server  results  in  an  improved  response  time 
performance.  This  improvement  is  less  than  the  improvement 
of  the  original  single  queue  system  over  the  multiple  queue 
system.  Overhead  does  not  improve  considerably  because  of 
the  increase  in  the  number  of  queries  in  the  system. 
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Appendix 

Table  7.3  Transitions  and  transition  rates  of  the  database  unit  model  with  all  rates  valid  for  all  policies,  based  on  which  matrix 
Q  is  formed 
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